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Preface 



The proceedings of ECML /PKDD 2004 are published in two separate, albeit in- 
tertwined, volumes: the Proceedings of the 15th European Conference on Machi- 
ne Learning (LNAI 3201) and the Proceedings of the 8th European Conferences 
on Principles and Practice of Knowledge Discovery in Databases (LNAI 3202). 
The two conferences were co-located in Pisa, Tuscany, Italy during September 
20-24, 2004. 

It was the fourth time in a row that ECML and PKDD were co-located. Af- 
ter the successful co-locations in Freiburg (2001), Helsinki (2002), and Cavtat- 
Dubrovnik (2003), it became clear that researchers strongly supported the orga- 
nization of a major scientific event about machine learning and data mining in 
Europe. 

We are happy to provide some statistics about the conferences. 581 different 
papers were submitted to ECML/PKDD (about a 75% increase over 2003); 280 
were submitted to ECML 2004 only, 194 were submitted to PKDD 2004 only, and 
107 were submitted to both. Around half of the authors for submitted papers are 
from outside Europe, which is a clear indicator of the increasing attractiveness 
of ECML/PKDD. 

The Program Committee members were deeply involved in what turned out 
to be a highly competitive selection process. We assigned each paper to 3 re- 
viewers, deciding on the appropriate PC for papers submitted to both ECML 
and PKDD. As a result, ECML PC members reviewed 312 papers and PKDD 
PC members reviewed 269 papers. We accepted for publication regular papers 
(45 for ECML 2004 and 39 for PKDD 2004) and short papers that were asso- 
ciated with poster presentations (6 for ECML 2004 and 9 for PKDD 2004). The 
global acceptance rate was 14.5% for regular papers (17% if we include the short 
papers). 

The scientific program of ECML/PKDD 2004 also included 5 invited talks, a 
wide workshop and tutorial program (10 workshops plus a Discovery Challenge 
workshop, and seven tutorials) and a demo session. 

We wish to express our gratitude to: 

— the authors of the submitted papers; 

— the program committee members and the additional referees for their excep- 
tional contribution to a tough but crucial selection process; 

— the invited speakers: Dimitris Achlioptas (Microsoft Research, Redmond), 
Rakesh Agrawal (IBM Almaden Research Center) , Soumen Chakrabarti (In- 
dian Institute of Technology, Bombay), Pedro Domingos (University of Wa- 
shington, Seattle), and David J. Hand (Imperial College, London); 

— the workshop chairs Donato Malerba and Mohammed J. Zaki; 

— the tutorial chairs Katharina Morik and Franco Turini; 

— the discovery challenge chairs Petr Berka and Bruno Cremilleux; 
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— the publicity chair Salvatore Ruggieri; 

— the demonstration chairs Rosa Meo, Elena Baralis, and Codrina Lauth; 

— the members of the ECML/PKDD Steering Committee Peter Flach, Luc De 
Raedt, Arno Siebes, Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, 
Hendrik Blockeel, Tapio Elomaa, Heikki Mannila, and Hannu T.T. Toivonen; 

— the members of the Award Committee, Michael May and Foster Provost; 

— the workshops organizers and the tutorialists; 

— the extremely efficient Organization Committee members, Maurizio Atzori, 
Miriam Baglioni, Sergio Barsocchi, Jeremy Besson, Francesco Bonchi, Stefa- 
no Ferilli, Tiziana Mazzone, Mirco Nanni, Ruggero Pensa, Simone Puntoni, 
Chiara Renso, Salvatore Rinzivillo, as well as all the other members of the 
KDD Lab in Pisa, Laura Balbarini and Cristina Rosamilia of L&B Studio, 
Elena Perini and Elena Tonsini of the University of Pisa; 

— the great Web masters Mirco Nanni, Chiara Renso and Salvatore Rinzivillo; 

— the directors of the two research institutions in Pisa that jointly made this 
event possible, Piero Maestrini (ISTI-CNR) and Ugo Montanari (Diparti- 
mento di Tnformatica); 

— the administration staff of the two research institutions in Pisa, in particu- 
lar Massimiliano Farnesi (ISTI-CNR), Paola Fabiani and Letizia Petrellese 
(Dipartimento di Informatica) ; 

— Richard van de Stadt (www.borbala.com) for his efficient support to the 
management of the whole submission and evaluation process by means of 
the CyberChairPRO software; 

— Alfred Hofmann of Springer for co-operation in publishing the proceedings. 

We gratefully acknowledge the financial support of KDNet, the Pascal Net- 
work, Kluwer and the Machine Learning journal, Springer, the Province of Luc- 
ca, the Province of Pisa, the Municipality of Pisa, Microsoft Research, COOP, 
Exeura, Intel, Talent, INSA-Lyon, ISTI-CNR Pisa, the University of Pisa, the 
University of Bari, and the patronage of Regione Toscana. 

There is no doubt that the impressive scientific activities in machine lear- 
ning and data mining world-wide were well demonstrated in Pisa. We had an 
exciting week in Tuscany, enhancing further co-operations between the many re- 
searchers who are pushing knowledge discovery into becoming a mature scientific 
discipline. 



July 2004 Jean-Frangois Boulicaut, 

Floriana Esposito, 
Fosca Giannotti, 
and Dino Pedreschi 




ECML/PKDD 2004 Organization 



Executive Committee 



Program Chairs 

Workshop Chairs 

Tutorial Chairs 

Discovery Challenge Chairs 

Publicity Chair 
Demonstration Chairs 

Steering Committee 



Awards Committee 



Organizing Committee 



Jean-Fran§ois Boulicaut (INS A Lyon) 

Floriana Esposito (Universita di Bari) 

Fosca Giannotti (ISTI-CNR) 

Dino Pedreschi (Universita di Pisa) 

Donato Malerba (University of Bari) 

Mohammed J. Zaki (Rensselaer Polytechnic Institute) 
Katharina Morik (University of Dortmund) 

Franco Turini (University of Pisa) 

Petr Berka (University of Economics, Prague) 

Bruno Cremilleux (University of Caen) 

Salvatore Ruggieri (University of Pisa) 

Rosa Meo (University of Turin) 

Elena Baralis (Politecnico of Turin) 

Ina Lauth (Fraunhofer Institute for Autonomous 
Intelligent Systems) 

Peter Flach (University of Bristol) 

Luc De Raedt (Albert-Ludwigs University, Freiburg) 
Arno Siebes (Utrecht University) 

Nada Lavrac (Jozef Stefan Institute) 

Dragan Gamberger (Rudjer Boskovic Institute) 

Ljupco Todorovski (Jozef Stefan Institute) 

Hendrik Blockeel (Katholieke Universiteit Leuven) 
Tapio Elomaa (Tampere University of Technology) 
Heikki Mannila (Helsinki Institute for 
Information Technology) 

Hannu T.T. Toivonen (University of Helsinki) 

Michael May (Fraunhofer Institute for Autonomous 
Intelligent Systems, KDNet representative) 

Floriana Esposito (PC representative) 

Foster Provost (Editor-in-Chief of Machine Learning 
Journal, Kluwer) 

Maurizio Atzori (KDDLab, ISTI-CNR) 

Miriam Baglioni (KDDLab, University of Pisa) 

Sergio Barsocchi (KDDLab, ISTI-CNR) 

Jeremy Besson (INSA Lyon) 

Francesco Bonchi (KDDLab, ISTI-CNR) 

Stefano Ferilli (University of Bari) 

Tiziana Mazzone (KDDLab) 

Mirco Nanni (KDDLab, ISTI-CNR) 

Ruggero Pensa (INSA Lyon) 

Chiara Renso (KDDLab, ISTI-CNR) 

Salvatore Rinzivillo (KDDLab, University of Pisa) 




VIII ECML/PKDD 2004 Organization 



ECML 2004 Program Committee 



Hendrik Blockeel, Belgium 

Marco Botta, Italy 

Henrik Bostrom, Sweden 

Jean-Fran§ois Boulicaut, France 

Ivan Bratko, Slovenia 

Pavel Brazdil, Portugal 

Nello Cristianini, USA 

James Cussens, UK 

Ramon Lopes de Mantaras, Spain 

Luc De Raedt, Germany 

Luc Dehaspe, Belgium 

Jose del R. Millan, Switzerland 

Saso Dzeroski, Slovenia 

Tapio Elomaa, Finland 

Floriana Esposito, Italy 

Peter Flach, UK 

Johannes Fiirnkranz, Germany 

Joao Gama, Portugal 

Dragan Gamberger, Croatia 

Jean-Gabriel Ganascia, France 

Fosca Giannotti, Italy 

Attilio Giordana, Italy 

Haym Hirsh, USA 

Thomas Hofmann, USA 

Tamas Horvath, Germany 

Thorsten Joachims, USA 

Dirnitar Kazakov, UK 

Roni Khardon, USA 

Joerg Kindermann, Germany 

Yves Kodratoff, France 

Igor Kononenko, Slovenia 

Stefan Kramer, Germany 

Miroslav Kubat, USA 

Stephen Kwek, USA 

Nada Lavrac, Slovenia 

Charles Ling, Canada 



Donato Malerba, Italy 

Heikki Mannila, Finland 

Stan Matwin, Canada 

Dunja Mladenic, Slovenia 

Katharina Morik, Germany 

Hiroshi Motoda, Japan 

Remi Munos, France 

Richard Nock, France 

David Page, USA 

Georgios Paliouras, Greece 

Dino Pedreschi, Italy 

Bernhard Pfahringer, New Zealand 

Enric Plaza, Spain 

Juho Rousu, UK 

Celine Rouveirol, France 

Tobias Scheffer, Germany 

Michele Sebag, France 

Giovanni Semeraro, Italy 

Arno Siebes, The Netherlands 

Robert Sloan, USA 

Gerd Stumme, Germany 

Henry Tirri, Finland 

Ljupco Todorovski, Slovenia 

Luis Torgo, Portugal 

Peter Turney, Canada 

Maarten van Someren, The Netherlands 

Paul Vitanyi, The Netherlands 

Sholom Weiss, USA 

Dietrich Wettschereck, UK 

Gerhard Widmer, Austria 

Marco Wiering, The Netherlands 

Ruediger Wirth, Germany 

Stefan Wrobel, Germany 

Thomas Zeugmann, Germany 

Tong Zhang, USA 

Blaz Zupan, Slovenia 




ECML/PKDD 2004 Organization 



IX 



PKDD 2004 Program Committee 



Elena Baralis, Italy 

Michael Bert hold, Germany 

Elisa Bertino, USA 

Hendrik Blockeel, Belgium 

Jean-Frangois Boulicaut, France 

Christopher W. Clifton, USA 

Bruno Cremilleux, France 

Luc De Raedt, Germany 

Luc Dehaspe, Belgium 

Saso Dzeroski, Slovenia 

Tapio Elomaa, Finland 

Floriana Esposito, Italy 

Martin Ester, Canada 

Ad Feelders, The Netherlands 

Ronen Feldman, IL 

Peter Flach, UK 

Eibe Frank, New Zealand 

Alex Freitas, UK 

Johannes Fiirnkranz, Germany 

Dragan Gamberger, Croatia 

Minos Garofalakis, USA 

Fosca Giannotti, Italy 

Christophe Giraud-Carrier, Switzerland 

Bart Goethals, Finland 

Howard Hamilton, Canada 

Robert Hilderman, Canada 

Haym Hirsh, USA 

Frank Hoeppner, Germany 

Se Hong, USA 

Samuel Kaski, Finland 

Daniel Keim, Germany 

Jorg-Uwe Kietz, Switzerland 

Ross King, UK 

Yves Kodratoff, France 

Joost Kok, The Netherlands 

Stefan Kramer, Germany 

Laks Lakshmanan, Canada 

Nada Lavrac, Slovenia 

Donato Malerba, Italy 

Giuseppe Manco, Italy 

Heikki Mannila, Finland 

Stan Matwin, Canada 

Michael May, Germany 



Rosa Meo, Italy 

Dunja Mladenic, Slovenia 

Katharina Morik, Germany 

Sliinichi Morishita, Japan 

Hiroshi Motoda, Japan 

Gholamreza Nakhaeizadeh, Germany 

Claire Nedellec, France 

David Page, USA 

Dino Pedreschi, Italy 

Zbigniew Ras, USA 

Jan Rauch, Czech Rebuclic 

Christophe Rigotti, France 

Gilbert Ritschard, Switzerland 

John Roddick, Australia 

Yucel Saygin, Turkey 

Michele Sebag, France 

Marc Sebban, France 

Arno Siebes, The Netherlands 

Andrzej Skowron, Poland 

Myra Spiliopoulou, Germany 

Nicolas Spyratos, France 

Reinhard Stolle, USA 

Gerd Stumme, Germany 

Einoshin Suzuki, Japan 

Ah-Hwee Tan, Singapore 

Ljupco Todorovski, Slovenia 

Hannu Toivonen, Finland 

Luis Torgo, Portugal 

Shusaku Tsumoto, Japan 

Franco Turini, Italy 

Maarten van Someren, The Netherlands 

Ke Wang, Canada 

Louis Wehenkel, Belgium 

Dietrich Wettschereck, UK 

Gerhard Widmer, Austria 

Ruediger Wirth, Germany 

Stefan Wrobel, Germany 

Osmar R. Zaiane, Canada 

Mohammed Zaki, USA 

Carlo Zaniolo, USA 

Djamel Zighed, France 

Blaz Zupan, Slovenia 




X 



ECML/PKDD 2004 Organization 



ECML/PKDD 2004 Additional Reviewers 



Fabio Abbattista 
Markus Ackermann 
Erick Alphonse 
Oronzo Altamura 
Massih Ainini 
Ahmed Amrani 
Anastasia Analiti 
Nicos Angelopoulos 
Fabrizio Angiulli 
Luiza Antonie 
Annalisa Appice 
Josep-Lluis Arcos 
Eva Armengol 
Thierry Artieres 
Maurizio Atzori 
Anne Auger 
Ilkka Autio 
Jerome Aze 
Vincenzo Bacarella 
Miriam Baglioni 
Yijian Bai 
Cristina Baroglio 
Teresa Basile 
Ganesan Bathumalai 
Fadila Bentayeb 
Margherita Berardi 
Bettina Berendt 
Petr Berka 
Guillaume Beslon 
Philippe Bessieres 
Matjaz Bevk 
Steffen Bickel 
Gilles Bisson 
Avrim Blum 
Axel Blumenstock 
Damjan Bojadziev 
Francesco Bonchi 
Toufik Boudellal 
Omar Boussaid 
Janez Blank 
Nicolas Bredeche 
Ulf Brefeld 
Wray Buntine 
Christoph Biischer 
Benjamin Bustos 
Niccolo Capanni 
Amedeo Cappelli 



Martin R.J. Carpenter 
Costantina Caruso 
Ciro Castiello 
Barbara Catania 
Davide Cavagnino 
Michelangelo Ceci 
Alessio Ceroni 
Jesus Cerquides 
Eugenio Cesario 
Silvia Chiusano 
Fang Chu 
Antoine Cornuejols 
Fabrizio Costa 
Gianni Costa 
Tom Croonenborghs 
Tomaz Curk 
Maria Damiani 
Agnieszka Dardzinska 
Tijl De Bie 
Edwin D. De Jong 
Kurt De Grave 
Marco Degemmis 
Janez Demsar 
Damjan Demsar 
Michel de Rougemont 
Nicola Di Mauro 
Christos Dimitrakakis 
Simon Dixon 
Kurt Driessens 
Isabel Drost 
Chris Drummond 
Wenliang Du 
Nicolas Durand 
Michael Egmont-Petersen 
Craig Eldershaw 
Mohammed El-Hajj 
Roberto Esposito 
Timm Euler 
Theodoras Evgeniou 
Anna Maria Fanelli 
Nicola Fanizzi 
Ayman Farahat 
Sebastien Ferre 
Stefano Ferilli 
Daan Fierens 
Thomas Finley 
Sergio Flesca 



Francois Fleuret 
Francesco Folino 
Francesco Fornasari 
Blaz Fortuna 
Andrew Foss 
Keith Frikken 
Barbara Furletti 
Thomas Gartner 
Ugo Galassi 
Arianna Gallo 
Byron Gao 
Paolo Garza 
Liqiang Geng 
Claudio Gentile 
Pierre Geurts 
Zoubin Ghahramani 
Arnaud Giacometti 
Emiliano Giovannetti 
Piotr Gmytrasiewicz 
Judy Goldsmith 
Anna Gomolinska 
Udo Grimmer 
Matthew Grounds 
Antonella Guzzo 
Amaury Habrard 
Stephan ten Hagen 
Jorg Hakenberg 
Mark Hall 
Greg Hamerly 
Ji He 

Jaana Heino 
Thomas Heitz 
Frank Herrmann 
Haitham Hindi 
Ayca Azgin Hintoglu 
Joachim Hipp 
Susanne Hoche 
Pieter Jan ’t Hoen 
Andreas Hotho 
Tomas Hrycej 
Luigi Iannone 
Inaki Inza 
Frangois Jacquenet 
Aleks Jakulin 
Jean-Christoplie Janodet 
Nathalie Japkowicz 
Tony Jebara 




ECML/PKDD 2004 Organization 



XI 



Tao-Yuan Jen 
Tao Jiang 
Xing Jiang 
Yuelong Jiang 
Alipio Jorge 
Pierre-Emmanuel Jouve 
Matti Kaariainen 
Spiros Kapetanakis 
Vangelis Karkaletsis 
Andreas Karwath 
Branko Kavsek 
Steffen Kempe 
Kristian Kersting 
Jahwan Kim 
Minsoo Kim 
Svetlana Kiritchenko 
Richard Kirkby 
Jyrki Kivinen 
Willi Kloesgen 
Gabriella Kokai 
Petri Kontkanen 
Dimitrios Kosmopoulos 
Mark-A. Krogel 
Jussi Kujala 
Matjaz Kukar 
Kari Laasonen 
Krista Lagus 
Lotfi Lakhal 
Stephane Lallich 
Gert Lanckriet 
John Langford 
Carsten Lanquillon 
Antonietta Lanza 
Michele Lapi 
Dominique Laurent 
Yan-Nei Law 
Neil Lawrence 
Gregor Leban 
Sau Dan Lee 
Gaelle Legrand 
Edda Leopold 
Claire Leschi 
Guichong Li 
Oriana Licchelli 
Per Liden 
Jussi T. Lindgren 
Francesca A. Lisi 
Bing Liu 
Zhenyu Liu 
Peter Ljubic 



Marco Locatelli 

Huma Lodlii 

Ricardo Lopes 

Pasquale Lops 

Robert Lothian 

Claudio Lucchese 

Jack Lutz 

Tuomo Malinen 

Michael Maltrud 

Suresh Manandhar 

Alain-Pierre Manine 

Raphael Maree 

Berardi Margherita 

Elio Masciari 

Cyrille Masson 

Nicolas Meger 

Carlo Meghini 

Corrado Mencar 

Amar-Djalil Mezaour 

Tatiana Miazhynskaia 

Alessio Micheli 

Taneli Mielikainen 

Ingo Mierswa 

Tommi Mononen 

Martin Mozina 

Thierry Murgue 

Mirco Nanni 

Phu Chien Nguyen 

Tuan Trung Nguyen 

Alexandra Niculescu-Mizil 

Siegfried Nijssen 

Janne Nikkila 

Blaz Novak 

Alexandros Ntoulas 

William O’Neill 

Kouzou Ohara 

Arlindo L. Oliveira 

Santiago Ontanon 

Riccardo Ortale 

Martijn van Otterlo 

Gerhard Paass 

Ignazio Palmisano 

Christian Panse 

Andrea Passerini 

Jaakko Peltonen 

Lourdes Pena 

Raffaele Perego 

Jose Ramon Quevedo Perez 

Fernando Perez-Cruz 

Georgios Petasis 



Johann Petrak 
Sergios Petridis 
Viet Phan-Luong 
Dimitris Pierrakos 
Joel Plisson 
Neoklis Polyzotis 
Lubos Popelfnsky 
Roland Priemer 
Kai Puolamaki 
Sabine Rabaseda 
Filip Radlinski 
Mika Raento 
Jan Ramon 
Ari Rantanen 
Pierre Renaux 
Chiara Renso 
Rita Ribeiro 
Lothar Richter 
Salvatore Rinzivillo 
Francois Rioult 
Stefano Rizzi 
Celine Robardet 
Mathieu Roche 
Pedro Rodrigues 
Teemu Roos 
Benjamin Rosenfeld 
Roman Rosipal 
Fabrice Rossi 
Olga Roudenko 
Antonin Rozsypal 
Ulrich Riickert 
Salvatore Ruggieri 
Stefan Riiping 
Nicolas Sabouret 
Aleksander Sadikov 
Taro L. Saito 
Lorenza Saitta 
Luka Sajn 
Apkar Salatian 
Marko Salmenkivi 
Craig Saunders 
Alexandr Savinov 
Jelber Sayyad Shirabad 
Francesco Scarcello 
Christoph Schmitz 
Joern Schneidewind 
Martin Scholz 
Tobias Schreck 
Ingo Schwab 
Mihaela Scuturici 




XII ECML/PKDD 2004 Organization 



Vasile-Marian Scuturici 
Alexander K. Seewald 
Jouni K. Seppanen 
Jun Sese 
Georgios Sigletos 
Marko Robnik-Sikonja 
Fabrizio Silvestri 
Janne Sinkkonen 
Mike Sips 
Dominik Slezak 
Giovanni Soda 
Larisa Soldatova 
Arnaud Soulet 
Alessandro Sperduti 
Jaroslaw Stepaniuk 
Olga Stepankova 
Umberto Straccia 
Alexander L. Strehl 
Thomas Strolnnann 
Jan Struyf 
Dorian Sue 

Henri-Maxime Suchier 
Johan Suykens 
Piotr Synak 
Marcin Szczuka 



Prasad Tadepalli 
Andrea Tagarelli 
Julien Tane 
Alexandre Termier 
Evimaria Terzi 
Franck Thollard 
Andrea Torsello 
Alain Trubuil 
Athanasios Tsakonas 
Chrisa Tsinaraki 
Ville Tuulos 
Yannis Tzitzikas 
Jaideep S. Vaidya 
Pascal Vaillant 
Alexandros Valarakos 
Anneleen Van Assche 
Antonio Varlaro 
Guillaume Vauvert 
Julien Velcin 
Celine Vens 
Naval K. Verma 
Ricardo Vilalta 
Alexei Vinokourov 
Daniel Vladusic 
Nikos Vlassis 



Alessandro Vullo 
Bernard Zenko 
Martin Znidarsic 
Haixun Wang 
Xin Wang 
Yizhou Wang 
Hannes Wettig 
Nirmalie Wiratunga 
Jakub Wroblewski 
Michael Wurst 
Dan Xiao 
Tomoyuki Yamada 
Robert J. Yan 
Hong Yao 
Ghim-Eng Yap 
Kihoon Yoon 
Bianca Zadrozny 
Fabio Zambetta 
Farida Zehraoui 
Bernard Zenko 
Xiang Zhang 
Alexander Zien 
Albrecht Zimmermann 




ECML/PKDD 2004 Organization XIII 



ECML/PKDD 2004 Tutorials 

Evaluation in Web Mining 

Bettina Berendt, Ernestina Menasalvas, Myra Spiliopoulou 

Symbolic Data Analysis 
Edwin Diday, Carlos Marcelo 

Radial Basis Functions: An Algebraic Approach (with Data Mining 
Applications) 

Amrit L. Goel, Miyoung Shin 

Mining Unstructured Data 
Ronen Feldman 

Statistical Approaches Used in Machine Learning 
Bruno Apolloni, Dario Malchiodi 

Rule-Based Data Mining Methods for Classification Problems in the 
Biomedical Domain 
Jinyan Li, Limsoon Wong 

Distributed Data Mining for Sensor Networks 
Hillol Kargupta 



ECML/PKDD 2004 Workshops 

Statistical Approaches for Web Mining (SAWM) 

Marco Gori, Michelangelo Ceci, Mirco Nanni 

Symbolic and Spatial Data Analysis: Mining Complex Data Structures 
Paula Brito, Monique Noirhomme 

Third International Workshop on Knowledge Discovery in Inductive 
Databases (KDID 2004) 

Bart Goethals, Amo Siebes 

Data Mining and Adaptive Modelling Methods for Economics and 
Management (IWAMEM 2004) 

Pavel Brazdil, Fernando S. Oliveira, Giulio Bottazzi 

Privacy and Security Issues in Data Mining 
Yiicel Saygin 




XIV ECML/PKDD 2004 Organization 



Knowledge Discovery and Ontologies 

Paul Buitelaar, Jurgen Franke, Marko Grobelnik, Gerhard Paafi, Vojtech Svatek 

Mining Graphs, Trees and Sequences (MGTS 2004) 

Joost Kok, Takashi Washio 

Advances in Inductive Rule Learning 
Johannes Fiirnkranz 

Data Mining and Text Mining for Bioinformatics 
Tobias Scheffer 

Knowledge Discovery in Data Streams 
Jesus Aguilar-Ruiz, Joao Gama 




Table of Contents 



Invited Papers 

Random Matrices in Data Analysis 1 

Dimitris Achlioptas 

Data Privacy 8 

Rakesh Agrawal 

Breaking Through the Syntax Barrier: 

Searching with Entities and Relations 9 

Soumen Chakrabarti 

Real-World Learning with Markov Logic Networks 17 

Pedro Domingos 

Strength in Diversity: The Advance of Data Analysis 18 

David J. Hand 

Contributed Papers 

Mining Positive and Negative Association Rules: 

An Approach for Confined Rules 27 

Maria-Luiza Antonie and Osmar R. Zaiane 

An Experiment on Knowledge Discovery in Chemical Databases 39 

Sandra Berasaluce, Claude Laurengo, Amedeo Napoli, and Gilles Niel 

Shape and Size Regularization in Expectation Maximization 

and Fuzzy Clustering 52 

Christian Borgelt and Rudolf Kruse 

Combining Multiple Clustering Systems 63 

Constantinos Boulis and Mari Ost.endorf 

Reducing Data Stream Sliding Windows by Cyclic Tree-Like Histograms . 75 
Francesco Buccafurri and Gianluca Lax 

A Framework for Data Mining Pattern Management 87 

Barbara Catania, Anna Maddalena, Maurizio Mazza, Elisa Bertino, 
and Stefano Rizzi 

Spatial Associative Classification at Different Levels of Granularity: 

A Probabilistic Approach 99 

Michelangelo Ceci, Annalisa Appice, and Donato Malerba 




XVI Table of Contents 



AutoPart: Parameter-Free Graph Partitioning and Outlier Detection 112 

Deepayan Chakrabarti 

Properties and Benefits of Calibrated Classifiers 125 

Ira Cohen and Moises Goldszmidt 

A Tree-Based Approach to Clustering XML Documents by Structure 137 

Gianni Costa, Giuseppe Manco, Riccardo Ortale, and Andrea Tagarelli 

Discovery of Regulatory Connections in Microarray Data 149 

Michael Egmont- Petersen, Wim de Jonge, and Amo Siebes 

Learning from Little: Comparison of Classifiers Given Little Training .... 161 
George Forman and Ira Cohen 

Geometric and Combinatorial Tiles in 0-1 Data 173 

Aristides Gionis, Heikki Mannila, and Jouni K. Seppanen 

Document Classification Through Interactive Supervision 

of Document and Term Labels 185 

Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, 
and Soumen Chakrabarti 

Classifying Protein Fingerprints 197 

Melanie Hilario, Alex Mitchell, Jee-Hyub Kim, Paul Bradley, 
and Terri Attwood 

Finding Interesting Pass Patterns from Soccer Game Records 209 

Shoji Hirano and Shusaku Tsumot.o 

Discovering Unexpected Information for Technology Watch 219 

Frangois Jacquenet. and Christine Largeron 

Scalable Density-Based Distributed Clustering 231 

Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle 

Summarization of Dynamic Content in Web Collections 245 

Adam Jat.owt and Mitsuru Ishizuka 

Mining Thick Skylines over Large Databases 255 

Wen Jin, Jiawei Han, and Martin Ester 

Ensemble Feature Ranking 267 

Kees Jong, Jeremie Mary, Antoine Cornuejols, Elena Marchiori, 
and Michele Sebag 

Privately Computing a Distributed fc-nn Classifier 279 

Murat Kantarcioglu and Chris Clifton 

Incremental Nonlinear PC A for Classification 291 

Byung Joo Kim and II Kon Kim 




Table of Contents XVII 



A Spectroscopy of Texts for Effective Clustering 301 

Wenyuan Li, Wee-Keong Ng, Kok-Leong Ong, and Ee-Peng Lim 

Constraint-Based Mining of Episode Rules and Optimal Window Sizes .... 313 
Nicolas Meger and Christophe Rigotti 

Analysing Customer Churn in Insurance Data - A Case Study 325 

Katharina Morik and Hanna Kopcke 

Nomograms for Visualization of Naive Bayesian Classifier 337 

Martin Mozina, Janez Demsar, Michael Rattan, and Blaz Zupan 

Using a Hash-Based Method for Apriori-Based Graph Mining 349 

Phu Chien Nguyen, Takashi Washio, Kouzou Ohara, 
and Hiroshi Motoda 

Evaluation of Rule Interestingness Measures 

with a Clinical Dataset on Hepatitis 362 

Miho Ohsaki, Shinya Kitaguchi, Kazuya Okamoto, Hideto Yokoi, 
and Takahira Yamaguchi 

Classification in Geographical Information Systems 374 

Salvatore Rinzivillo and Franco Turini 

Digging into Acceptor Splice Site Prediction: 

An Iterative Feature Selection Approach 386 

Yvan Saeys, Sven Degroeve, and Yves Van de Peer 

Itemset Classified Clustering 398 

Jun Sese and Shinichi Morishita 

Combining Winnow and Orthogonal Sparse Bigrams 

for Incremental Spam Filtering 410 

Christian Siejkes, Fidelis Assis, Shalendra Chhabra, 
and William S. Yerazunis 

Asynchronous and Anticipatory Filter-Stream Based Parallel Algorithm 

for Frequent Itemset Mining 422 

Adriano Veloso, Wagner Meira Jr., Renato Ferreira, 

Dorgival Guedes Neto, and Srinivasan Parthasarathy 

A Quantification of Cluster Novelty with an Application 

to Martian Topography 434 

Ricardo Vilalta, Tom St.epinski, Muralikrishna Achari, 
and Francisco Ocegueda- Hernandez 

Density-Based Spatial Clustering in the Presence of Obstacles 

and Facilitators 446 

Xin Wang, Camilo Rostoker, and Howard J. Hamilton 




XVIII Table of Contents 



Text Mining for Finding Functional Community of Related Genes 

Using TCM Knowledge 459 

Zhaohui Wu, Xuezhong Zhou, Baoyan Liu, and Junli Chen 

Dealing with Predictive-but-Unpredictable Attributes 

in Noisy Data Sources 471 

Ying Yang, Xindong Wu, and Xingquan Zhu 

A New Scheme on Privacy Preserving Association Rule Mining 484 

Nan Zhang, Shengquan Wang, and Wei Zhao 

Posters 

A Unified and Flexible Framework for Comparing Simple 

and Complex Patterns 496 

Ilaria Bartolini, Paolo Ciaccia, Irene Ntoutsi, Marco Patella, 
and Yannis Theodoridis 

Constructing (Almost) Phylogenetic Trees 

from Developmental Sequences Data 500 

Ronnie Bathoorn and Arno Siebes 

Learning from Multi-source Data 503 

Elisa Fromont, Marie- Odile Cordier, and Rene Quiniou 

The Anatomy of SnakeT : A Hierarchical Clustering Engine 

for Web-Page Snippets 506 

Paolo Ferragina and Antonio Gulli 

COCOA: Compressed Continuity Analysis for Temporal Databases 509 

Kuo- Yu Huang, Chia-Hui Chang, and Kuo-Zui Lin 

Discovering Interpretable Muscle Activation Patterns 

with the Temporal Data Mining Method 512 

Fabian Morchen, Alfred Ult.sch, and Olaf Hoos 

A Tolerance Rough Set Approach to Clustering Web Search Results 515 

Chi Lang Ngo and Hung Son Nguyen 

Improving the Performance of the RISE Algorithm 518 

Aloisio Carlos de Pina and Gerson Zaverucha 

Mining History of Changes to Web Access Patterns 521 

Qiankun Zhao and Sourav S. Bhowmick 

Demonstration Papers 

Visual Mining of Spatial Time Series Data 524 

Gennady Andrienko, Natalia Andrienko, and Peter Gatalsky 




Table of Contents XIX 



Detecting Driving Awareness 528 

Bruno Apolloni, Andrea Brega, Dario Malchiodi, and Cristian Mesiano 

An Effective Recommender System for Highly Dynamic 

and Large Web Sites 531 

Ranieri Baraglia, Francesco Merlo, and Fabrizio Silvest.ri 

SemanticTalk: Software for Visualizing Brainstorming Sessions 

and Thematic Concept Trails on Document Collections 534 

Chris Biemann,, Karsten Bohm, Gerhard Heyer, and Ronny Melz 

Orange: From Experimental Machine Learning 

to Interactive Data Mining 537 

Janez Demsar, Blaz Zupan, Gregor Leban, and Tomaz Curk 

Terrorist Detection System 540 

Yuval Elovici, Abraham Kandel, Mark Last, Bracha Shapira, 

Omer Zaafrany, Moti Schneider, and Menahem Friedman 

Experimenting SnakeT: A Hierarchical Clustering Engine 

for Web-Page Snippets 543 

Paolo Ferragina and Antonio Gulli 

HIClass: Hyper-interactive Text Classification 

by Interactive Supervision of Document and Term Labels 546 

Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, 
and Soumen Chakrabarti 

Balios - The Engine for Bayesian Logic Programs 549 

Kristian Kersting and Uwe Dick 

SEWeP: A Web Mining System Supporting Semantic Personalization 552 

Strat.os Paulakis, Charalampos Lampos, Magdalini Eirinaki, 
and Michalis Vazirgiannis 

SPIN! Data Mining System Based on Component Architecture 555 

Alexandr Savinov 



Author Index 



559 




Random Matrices in Data Analysis 



Dimitris Achlioptas 

Microsoft Research, Redmond, WA 98052, USA 
optasOmicrosof t . com 



Abstract. We show how carefully crafted random matrices can achieve 
distance-preserving dimensionality reduction, accelerate spectral compu- 
tations, and reduce the sample complexity of certain kernel methods. 



1 Introduction 

Given a collection of n data points (vectors) in high-dimensional Euclidean space 
it is natural to ask whether they can be projected into a lower dimensional 
Euclidean space without suffering great distortion. Two particularly interesting 
classes of projections are: i) projections that tend to preserve the interpoint 
distances, and ii) projections that maximize the average projected vector length. 

In the last few years, distance-preserving projections have had great impact in 
theoretical computer science where they have been useful in a variety of algorith- 
mic settings, such as approximate nearest neighbor search, clustering, learning 
mixtures of distributions, and computing statistics of streamed data. 

The general idea is that by providing a low dimensional representation of the 
data, distance-preserving embeddings dramatically speed up algorithms whose 
run-time depends exponentially in the dimension of the working space. At the 
same time, the provided guarantee regarding pairwise distances often allows one 
to show that the solution found by working in the low dimensional space is a 
good approximation to the solution in the original space. 

Perhaps the most commonly used projections aim at maximizing the average 
projected vector length, thus retaining most of the variance in the data. This 
involves representing the data as a matrix A, diagonalizing A = U DV, and pro- 
jecting A onto subspaces spanned by the vectors in U or V corresponding to the 
largest entries in D. Variants of this idea are known as Karhunen-Loeve trans- 
form, Principal Component Analysis, Singular Value Decomposition and others. 

In this paper we examine different applications of random matrices to both 
kinds of projections, all stemming from variations of the following basic fact: if 
R is an n x n random matrix whose entries are i.i.d. Normal random variables, 
N(0, 1), then the matrix -)= R is very close to being orthonormal. 



2 Euclidean Distance Preservation 

A classic result of Johnson and Lindenstrauss [7] asserts that any set of n points 
in can be embedded into R fc , with k = 0(log n), so that all pairwise distances 
are maintained within an arbitrarily small factor. More precisely, 
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Lemma 1 ([7]). Given 0 < e < 1 and an integer n, let k be a positive integer 
such that k > ko = (12/e 2 ) log n. For every set P of n points in there exists 
/ : — > R fc such that for all u,v € P 

( l - e ) ll ^-^ M 2 < ||/(^) -/(^)|| 2 < (l + e )||^^ ^|| 2 . 

Perhaps, a naive attempt to construct an embedding as above would be to 
pick a random set of k coordinates from the original space. Unfortunately, two 
points can be very far apart while differing only along one original dimension, 
dooming this approach. On the other hand, if (somehow) for all pairs of points, 
all coordinates contributed “roughly equally” to their distance, such a sampling 
scheme would be very natural. This consideration motives the following idea: first 
apply a random rotation to the n points, and then pick the first k coordinates as 
the new coordinates. The random rotation can be viewed as a form of insurance 
against axis alignment, analogous to applying a random permutation before 
running Quicksort. 

Of course, applying a random rotation and then taking the first k coordinates 
is equivalent to projecting the n points on a uniformly random fc-dimensional 
subspace. Indeed, this is exactly how the original proof of Lemma 1 by Johnson 
and Lindenstrauss proceeds: to implement the embedding, multiply the n x d 
data matrix A with a random d x k orthonormal matrix. Dasgupta and Gupta [5] 
and, independently, Indyk and Motwani [6] more recently gave a simpler proof of 
Lemma 1 by taking the following more relaxed approach towards orthonormality. 

The key idea is to consider what happens if we multiply A with a random dxk 
matrix R whose entries are independent Normal random variables with mean 0 
and variance 1, i.e., IV(0, 1). It turns out that while we do not explicitly enforce 
either orthogonality or normality in R , its columns will come very close to having 
both of these properties. This is because, as d increases: (i) the length of each 
column- vector concentrates around its expectation as the sum of d independent 
random variables; (ii) by the spherical symmetry of the Gaussian distribution, 
each column- vector points in a uniformly random direction in R d , making the 
k < d independent column-vectors nearly orthogonal with high probability. 

More generally, let I? be a random matrix whose entries are independent 
random variables with E(r'y) = 0 and Var (r^) = 1. If / : R d — > R fe is given by 

/0) = -j= xR , 

it is easy to check that for any vector x £ we have E(||/(x)||) = ||a;||. 
Effectively, the squared inner product of x with each column of R acts as an 
independent estimate of ||a;|| 2 , making ||/(a;)|| 2 the consensus estimate (sum) of 
the k estimators. Seen from this angle, requiring the k vectors to be orthonormal 
simply maximizes the mutual information of the k estimators. For good dimen- 
sionality reduction, we also need to minimize the variance of the estimators. 

In [1], it was shown that taking = ±1 with equal probability, in fact, 
slightly reduces the number of required dimensions k (as the variance of each 
column-estimator is slightly smaller). At the same time, and more importantly, 
this choice of r,j makes / a lot easier to work with in practice. 
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3 Computing Low Rank Approximations 

Given n points in represented as an n x d matrix A , one of the most common 
tasks in data analysis is to find the “top k” singular vectors of A and then project 
A onto the subspace they span. Such low rank approximations are used widely 
in areas such as computer vision, information retrieval, and machine learning to 
extract correlations and remove noise from matrix-structured data. 

Recall that the top singular vector of a matrix A is the maximizer of 1 1 Aar 1 1 2 
over all unit vectors x. This maximum is known as the L 2 norm of A and the 
maximizer captures the dominant linear trend in A. Remarkably, this maximizer 
can be discovered by starting with a random unit vector x £ and repeating the 
following “voting process” until it reaches a fixpoint, i.e. , until x stops rotating: 

— Have each of the n rows in A vote on candidate x, i.e., compute y = Ax £ R". 

— Compose a new candidate by combining the rows of A, weighing each row 

A T y 

by its enthusiasm for x, i.e., update x <— ' £ R d . 

\\ A V II 

The above idea extends to k > 1. To find the fc-dimensional invariant sub- 
space of A, one starts with a random subspace, i.e., a random d x k orthonormal 
matrix, and repeatedly multiplies by A T A (orthonormalizing after each mul- 
tiplication). Computing the singular row-vectors of A, i.e., the eigenvectors of 
B = A T A, is often referred to as Principal Component Analysis (PCA). The fol- 
lowing process achieves the exact same goal, by extracting the dominant trends in 
A sequentially, in order of strength: let Ao be the all zeros matrix; for i — 1, . . . , k: 

— Find the top singular vector, Xi, of A — Aj_i, via the voting process above. 

— Let A,j = Aj_i + Axixf , i.e., A is the optimal rank i approximation to A. 



To get an idea of how low rank approximations can remove noise, let G be 
annxd random matrix whose entries are i.i.d. iV(0,(7 2 ) random variables. We 
saw earlier that each column of G points in an independent, uniformly random 
direction in M™. As a result, when n is large, with high probability the d < n 
columns of G are nearly orthogonal and there is no low-dimensional subspace 
that simultaneously accommodates many of them. This means that when we 
compute a low rank approximation of A + G, as long as a is “not too large” (in 
a sense we will make precise) , the columns of G will exert little influence as they 
do not strongly favor any particular low-dimensional subspace. Assuming that 
A contains strong linear trends, it is its columns that will command and receive 
accommodation . 

To make this intuition more precise, we first state a general bound on the 
impact that a matrix N can have on the optimal rank k approximation of a 
matrix A, denoted by Ah, as a function of ||iVfc||. Recall that ||A||f = \ Yhi j A ij ■ 
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Lemma 2. For any matrices A and A, if A = A + A then 
II ^4 — Af~ U 2 < ||A — 1 1 2 + 2 1 1 iVfc 1 1 2 and 

|| A — A k \\F < ||A — Ak\\F + 1 1 Afc 1 1 F + 2-\/|| Afc||.p||Afc||F ■ 

Notice that all error terms above scale with ||Afc||. As a result, whenever A is 
poorly approximated in k dimensions, i.e., ||Afc|| is small, the error caused by 
adding A” to a matrix A is also small. 

Let us consider the norms of our Gaussian perturbation matrix. 

Fact 1 Let G be a random n x d matrix, where d < n, whose entries are i.i.d. 
random variables A( 0, a 2 ). For any e > 0, with probability 1 — l/poly(n, e), 

||G|| 2 = ||G fc || 2 < (2 + e) er\/n and ||G fe ||F < (2 + e) aVkn . 

Remarkably, the upper bound above for ||G|| 2 is within a factor of 2 of the lower 
bound Osjn on the L 2 norm of any n x d matrix with mean squared entry a 2 . 
In other words, a random Gaussian matrix is nearly as unstructured as possible, 
resembling white noise in the flatness of its spectrum. On the other hand, ||A|| 2 
can be as large as a\fdn for an n x d matrix A with mean squared entry a 2 . 

This capacity of spectral techniques to remove Gaussian noise is by now very 
well-understood. We will see that the above geometric explanation of this fact 
can actually accommodate much more general noise models, e.g. Ay that are 
not identically distributed and, in fact, whose distribution depends on Ay. In 
the next section, this generality will enable the notion of “computation-friendly 
noise”, i.e., noise that enhances (rather than hinders) spectral computations. 

Fact 1 also suggests a criterion for choosing a good value of k when seeking 
low rank approximations of a n x d data matrix A: 

II A — Afc || 2 ~ a^fn, where a 2 is the mean squared entry in A — Ak- 
in words: we should stop when, after projecting onto the top k singular vectors, 
we are left with a matrix, A — Afc , whose strongest linear trend is comparable to 
that of a random matrix of similar scale. 

3.1 Co-opting the Noise Process 

Computing optimal low rank approximations of large matrices often runs against 
practical computational limits since the algorithms for this task generally require 
superlinear time and a large working set. On the other hand, in many applica- 
tions it is perfectly acceptable just to find a rank k matrix C satisfying 

II A — C|| < || A — Afc || + 5 , 

where Afc is the optimal rank k approximation of the input matrix A, and S 
captures an appropriate notion of “error tolerance” for the domain at hand. 
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In [2] , it was shown that with the aid of randomization one can exploit such an 
“error allotment” to aid spectral computations. The main idea is as follows. 

Imagine, first, that we squander the error allotment by obliviously adding to 
A a Gaussian matrix G, as in the previous section. While this is not likely to 
yield a computationally advantageous matrix, we saw that at least it is rather 
harmless. The first step in using noise to aid computation is realizing that G is 
innocuous due precisely to the following three properties of its entries: 

independence, zero mean, small variance. 

The fact that the Gij are Gaussian is not essential: a fundamental result of 
Fliredi and Komlos [4] shows that Fact 1 generalizes to random matrices where 
the entries can have different, in fact arbitrary, distributions as long as all N,j :j 
are zero-mean, independent, and their variance is bounded by a 2 . 

To exploit this fact for computational gain, given a matrix A, we will create 
a distribution of noise matrices N that depends on A, yet is such that the ran- 
dom variables Nij still enjoy independence, zero mean, and small variance. In 
particular, we will be able to choose N so that A = A + N has computationally 
useful properties, such as sparsity, yet N is sufficiently random for || W, || to be 
small with high probability. 

Example: Set N t j = AA z j with equal probability, independently for all i,j. 

In this example, the random variables N i: j are independent, E[Al,:j] = 0 for 
all i, j, and the standard deviation of equals Aij. On the other hand, the 
matrix A = A + N will have about half as many non-zero entries as A, i.e., it 
will be about twice as sparse. Therefore, while ||A ||2 can be proportional to Vdn, 
the error term || AT|| 2 , i.e., the price for the sparsification, is only proportional to 
Vn. 

The rather innocent example above can be greatly generalized. To simplify 
exposition, in the following, we assume that A^ £ [ — 1 , +1] . 

— Quantization: For all i,j , independently, set A. ij to +1 with probability 
(1 + A^)/ 2, and to —1 with probability (1 — Aij)/ 2. Clearly, for all i,j, we 
have E[iVjj] = E [A^ — A^] = 0, while Vai(Nij) < N/j < 4. 

— Uniform sampling: For any desired fraction p £ (0,1], set Ajj = Aij/p 
with probability p , and 0 otherwise. Now, Var (Nij) = Afj ( 1 — p)/p < 1/p, 
so that the error grows only as 1 /^/p as we retain a p-fraction of all entries. 

— Weighted sampling: For all i,j , independently, set A ^ = A^/pij with 
probability p l j , and 0 otherwise, where pij = pA 2 j ■ This way we retain even 
fewer small entries, while maintaining Var (N^) = 1/p — Afj < 1/p. 

Reducing the number of non-zero entries and their representation length 
causes standard eigenvalue algorithms to work faster. Moreover, the reduced 
memory footprint of the matrix A enables the handling of larger data sets. At a 
high level, we perform data reduction by randomly perturbing each data vector 
so as to simplify its representation, i.e., sparsify and quantize. The point is that 
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the perturbation vectors we use, by virtue of their independence, do not fit in a 
small subspace, acting effectively as “white noise” that is largely filtered out. 

4 Kernel Principal Component Analysis 

Given a collection X of training data x \, . . . , x n £ K d , techniques such as linear 
SVMs and PCA extract features from X by computing linear functions of X. 
However, often the structure present in the training data is not a linear function 
of the data representation. Worse, many data sets do not readily support linear 
operations such as addition and scalar multiplication (text, for example). 

In a “kernel method” the idea is to map X into a space TL equipped with 
inner product. The dimension of TL can be very large, even infinite, and therefore 
it may not be practical (or possible) to work with the mapped data explicitly 
by applying <P : X — > TL . Nevertheless, in many interesting cases it is possible to 
efficiently evaluate the dot products (<P(xi),<P(xj)) via a positive definite kernel 
k for i.e., a function k so that k(xi,Xj) = ($(xi),<l>(xj)). Algorithms whose 
operations can be expressed in terms of inner products can thus operate on 'P(X) 
implicitly, given only the Gram matrix 

I<ij := k(xi,Xj) . 

Given n training data points, the Kernel PCA (KPCA) method [8] begins 
by forming the Gram matrix K above and computing the t largest eigenvalues, 
Ai, . . . , X(, and corresponding eigenvectors, ei, . . . , of K 1 for some appropriate 
choice of i < n. Then, given an input point x, the method computes the value of 
the l nonlinear feature extractors, corresponding to the inner product of the vec- 
tor k(x) = ( k(x , X \ ), k(x , X 2 ), . . . , k{x , x n )) with each of the eigenvectors. These 
feature-values can be used for clustering, classification etc. 

While Kernel PCA is very powerful the matrix K, in general, is dense mak- 
ing the input size scale as n , where n is the number of training points. As 
kernel functions become increasingly more sophisticated, e.g. invoking dynamic 
programming to evaluate the similarity k(xi,Xj) of two strings Xi,Xj , just the 
cost of 0(n 2 ) kernel evaluations to construct K rapidly becomes prohibitive. 

The uniform sparsification and quantization techniques of the previous sec- 
tion are ideally suited for speeding up KPCA. In particular, “sparsification” here 
means that we actually only construct a matrix K by computing k(xi,Xj) for a 
uniformly random subset of all input pairs Xi, Xj and filling in 0 for the remaining 
pairs. In [3], it was proven that as long as K has strong linear structure (which 
is what justifies KPCA in the first place), with high probability, the invariant 
subspaces of K will be very close to those of K. 

Also, akin to quantization, we can replace each exact evaluation of k(xi,Xj ) 
with a more easily computable unbiased estimate for it. In [3], it was shown 
that for kernels where: i) X C R d , and, ii) k(xi,Xj) depends only on ||xj — Xj || 
and/or Xi ■ Xj, one can use random projections, as described in Section 2 for 
this purpose. Note that this covers some of the most popular kernels, e.g., radial 
basis functions (RBF) and polynomial kernels. 
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5 Future Work 

Geometric and spectral properties of random matrices with zero-mean, indepen- 
dent entries are the key ingredients in all three examples we considered [1-3]. 
More general ensembles of random matrices hold great promise for algorithm 
design and call for a random matrix theory motivated from a computational per- 
spective. Two natural directions are the investigation of matrices with limited 
independence, and the development of concentration inequalities for non-linear 
functionals of random matrices. 

We saw that sampling and quantizing matrices can be viewed as injecting 
“noise” into them to endow useful properties such as sparsity and succinctness. 
The distinguishing feature of this viewpoint is that the effect of randomization is 
established without an explicit analysis of the interaction between randomness 
and computation. Instead, matrix norms act as an interface between the two 
domains: (i) matrix perturbation theory asserts that matrices of small spectral 
norm cannot have a large effect in eigencomputations, while (ii) random matrix 
theory asserts that matrices of zero-mean, independent random variables with 
small variance have small spectral norm. Is it possible to extend this style of 
analysis to other machine-learning settings, e.g. Support Vector Machines? 
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There is increasing need to build information systems that protect the privacy 
and ownership of data without impeding the flow of information. We will present 
some of our current work to demonstrate the technical feasibility of building such 
systems: 

Privacy-preserving data mining. The conventional wisdom held that data min- 
ing and privacy were adversaries, and the use of data mining must be restricted 
to protect privacy. Privacy-preserving data mining cleverly finesses this conflict 
by exploting the difference between the level where we care about privacy, i.e. , 
individual data, and the level where we run data mining algorithms, i.e., ag- 
gregated data. User data is randomized such that it is impossible to recover 
anything meaningful at the individual level, while still allowing the data mining 
algorithms to recover aggregate information, build mining models, and provide 
actionable insights. 

Hippocratic databases. Unlike the current systems, Hippocratic databases include 
responsibility for the privacy of data they manage as a founding tenet. Their 
core capabilities have been distilled from the principles behind current privacy 
legislations and guidelines. We identify the technical challenges and problems in 
designing Hippocratic databases, and also outline some solutions. 

Sovereign information sharing. Current information integration approaches are 
based on the assumption that the data in each database can be revealed com- 
pletely to the other databases. Trends such as end-to-end integration, outsourc- 
ing, and security are creating the need for integrating information across au- 
tonomous entities. In such cases, the enterprises do not wish to completely re- 
veal their data. In fact, they would like to reveal minimal information apart 
from the answer to the query. We have formalized the problem, identified key 
operations, and designed algorithms for these operations, thereby enabling a new 
class of applications, including information exchange between security agencies, 
intellectual property licensing, crime prevention, and medical research. 



References 

1. R. Agrawal, R. Srikant: Privacy Preserving Data Mining. ACM Int’l Conf. on Man- 
agement of Data (SIGMOD), Dallas, Texas, May 2000. 

2. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu: Hippocratic Databases. 28th Int’l Conf. 
on Very Large Data Bases (VLDB), Hong Kong, August 2002. 

3. R. Agrawal, A. Evfhnievski, R. Srikant: Information Sharing Across Private 
Databases. ACM Int’l Conf. on Management of Data (SIGMOD), San Diego, Cali- 
fornia, June 2003. 



J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202, p. 8, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Breaking Through the Syntax Barrier: 
Searching with Entities and Relations 



Soumen Chakrabarti 

IIT Bombay 
soumenOcse . iitb . ac . in 



Abstract. The next wave in search technology will be driven by the 
identification, extraction, and exploitation of real-world entities repre- 
sented in unstructured textual sources. Search systems will either let 
users express information needs naturally and analyze them more intel- 
ligently, or allow simple enhancements that add more user control on 
the search process. The data model will exploit graph structure where 
available, but not impose structure by fiat. First generation Web search, 
which uses graph information at the macroscopic level of inter-page hy- 
perlinks, will be enhanced to use fine-grained graph models involving 
page regions, tables, sentences, phrases, and real-world-entities. New al- 
gorithms will combine probabilistic evidence from diverse features to 
produce responses that are not URLs or pages, but entities and their 
relationships, or explanations of how multiple entities are related. 



1 Toward More Expressive Search 

Search systems for unstructured textual data have improved enormously since 
the days of boolean queries over title and abstract catalogs in libraries. Web 
search engines index much of the full text from billions of Web pages and serve 
hundreds of millions of users per day. They use rich features extracted from the 
graph structure and markups in hypertext corpora. 

Despite these advances, even the most popular search engines make us feel 
that we are searching with mere strings: we do not find direct expression of 
the entities involved in our information need, leave alone relations that must 
hold between those entities in a proper response. In a plenary talk at the 2004 
World-wide Web Conference, Udi Manber commented: 

If music had been invented ten years ago along with the Web, we would 
all be playing one-string instruments (and not making great music). 

referring to the one-line text boxes in which users type in 1-2 keywords and 
expect perfect gratification with the responses. 

Apart from classical Information Retrieval (IR), several communities are 
coming together in the quest of expressive search, but they are coming from 
very different origins. 

Databases and XML: To be sure, the large gap between the user’s information 
need and the expressed query is well-known. The database community has been 
traditionally uncomfortable with the imprecise nature of queries inherent in IR. 
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The preference for precise semantics has persisted from SQL to XQuery (the 
query language proposed for XML data). The rigor, while useful for system- 
building, has little appeal for the end-user, who will not type SQL, leave alone 
XQuery. 

Two communities are situated somewhere between “uninterpreted” keyword 
search systems and the rigor of database query engines. Various sub-communities 
of natural language processing (NLP) researchers are concerned with NL inter- 
faces to query systems. The other community, which has broad overlaps with 
the NLP community, deals with information extraction (IE). 

NLP: Classical NLP is concerned with annotating grammatical natural lan- 
guage with parts of speech (POS), chunking phrases and clauses, disambiguating 
polysemous words, extracting a syntactic parse, resolving pronoun and other ref- 
erences, analyze roles (eating with a spoon vs. with a friend), prepare a complete 
computer-usable representation of the knowledge embedded in the original text, 
and perform automatic inference with this knowledge representation. Outside 
controlled domains, most of these, especially the latter ones, are very ambitious 
goals. Over the last decade, NLP research has gradually moved toward building 
robust tools for the simpler tasks [19]. 

IE: Relatively simple NLP tasks, such as POS tagging, named entity tagging, 
and word sense disambiguation (WSD) share many techniques from machine 
learning and data mining. Many such tasks model unstructured text as a se- 
quence of tokens generated from a finite state machine, and solve the reverse 
problem: given the output token sequence, estimate the state sequence. E.g., if 
we are interested in extracting dates from text, we can have a positive and a 
negative state, and identify the text spans generated by the positive state. IE is 
commonly set up as a supervised learning problem, which requires training text 
with labeled spans. 

Obviously, to improve the search experience, we need that 

— Users express their information need in some more detail, while minimizing 

additional cognitive burden 

— The system makes intelligent use of said detail, thus rewarding the burden 

the user agrees to undertake 

This new contract will work only if the combination of social engineering and 
technological advances work efficiently in concert. 

2 The New Contract: Query Syntax 

Suitable user interfaces, social engineering, and reward must urge the user to 
express their information need in some more detail. Relevance feedback, offering 
query refinements, and encouraging the user to drill down into response clusters 
are some ways in which systems collect additional information about the user’s 
information need. But there are many situations where direct input from the 
user can be useful. I will discuss two kinds of query augmentation. 

Fragments of types: If the token 2000 appears in grammatical text, cur- 
rent technology can usually disambiguate between the year and some other 
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number, say a money amount. There is no reason why search interfaces can- 
not accept a query with a type hint so as to avoid spurious matches. There is 
also no reason a user cannot look for persons related to SVMs using the query 
PersonType NEAR "SVM", where PersonType is the anticipated response type 
and SVM a word to match. To look for a book in SVMs published around year 
2000, one might type BookType (NEAR "SVM" year~2000). I believe that the 
person composing the query, being the stakeholder in response quality, can be 
encouraged to provide such elementary additional information, provided the re- 
ward is quickly tangible. Moreover, reasonably deep processing power can be 
spent on the query, and this may even be delegated to the client computer. 

Attributes, roles and relations: Beyond annotating query tokens with type 
information, the user may want to express that they are looking for “a business 
that repairs iMacs,” “the transfer bandwidth of USB2.0,” and “papers written 
in 1985 by C. Mohan.” It should be possible to express broad relations between 
entities in the query, possibly the placeholder entity that must be instantiated 
into the answer. The user may constrain the placeholder entity using attributes 
(e.g. MacOS-compliant software), roles and relations (e.g., a student advised by 
X). The challenge will be to support an ever- widening set of attribute types, 
roles and relations while ensuring ongoing isolation and compatibility between 
knowledge bases, features, and algorithms. 

Compared to query syntax and preprocessing, whose success depends largely 
on human factors, we have more to say about executing the internal form of the 
query on a preprocessed corpus. 

3 The New Contract: Corpus and Query Processing 

While modest changes may be possible in users’ query behavior, there is far too 
much inertia to expect content creators to actively assist mediation in the imme- 
diate future. Besides, questions preprocessing can be distributed economically, 
but corpus processing usually cannot. 

The situation calls for relatively light processing of the corpus, at least until 
query time. During large scale use, however, a sizable fraction of the corpus may 
undergo complex processing. It would be desirable but possibly challenging to 
cache the intermediate results in a way that can be reused efficiently. 

3.1 Supervised Entity Extraction 

Information extraction (IE), also called named entity tagging, annotates spans of 
unstructured text with markers for instances of specified types, such as people, 
organizations, places, dates, and quantities. 

A popular framework [11] models the text as a linear sequence of tokens 
being generated from a Markov state machine. A parametric model for state 
transition and symbol emission is learned from labeled training data. Then the 
model is evaluated on test data, and spans of tokens likely to be generated by 
desired states are picked off as extracted entities. 

Generative models such as hidden Markov models (HMMs) have been used 
for IE for a while [7]. If s is the (unknown) sequence of states and x the sequence 
of output features, HMMs seek to optimize the joint likelihood Pr(s,x). 
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In general, x is a sequence of feature vectors. Apart from the tokens them- 
selves, some derived features found beneficial in IE are of the form: Does the 
token 

— Contain a digit, or digits and commas? 

— Contain patterns like DD:DD or DDDD or DD’s where D is a digit? 

— Follow a preposition? 

— Look like a proper noun (as flagged by a part-of-speeclr tagger 1 )? 

— Start with an uppercase letter? 

— Start with an uppercase letter and continue with lowercase letters? 

— Look like an abbreviation (e.g., uppercase letters alternating with periods)? 

The large dimensionality of the feature vectors usually corners us into naive inde- 
pendence assumptions about Pr(s,x), and the large redundancy across features 
then lead to poor estimates of the joint distribution. 

Recent advances in modeling conditional distributions [18] directly optimize 
Pr(s|x), allowing the use of many redundant features without attempting to 
model the distribution over x itself. 



3.2 Linkage Analysis and Alias Resolution 

After the IE step, spans of characters and tokens are marked with type iden- 
tifiers. However, many string spans (called aliases ) may refer to a single en- 
tity (e.g., IBM, International Business Machines , Big Blue, the computer giant 
or www.ibm.com). The variations may be based on abbreviations, pronouns, 
anaphora, hyperlinks and other creative ways to create shared references to en- 
tities. Some of these aliases are syntactically similar to each other but others are 
not. 

In general, detecting aliases from unstructured text, also called coreferent 
resolution, in a complete and correct manner is considered “NLP complete,” 
i.e., requires deep language understanding and vast amounts of world knowl- 
edge. Alias resolution is an active and difficult area of NLP research. In the IE 
community, more tangible success has been achieved within the relatively limited 
scope of record linkage. 

In record linkage, the first IE step results in structured tables of entities, 
each having attributes and relations to other entities. E.g., we may apply IE 
techniques to bibliographies at the end of research papers to populate a table of 
papers, authors, conferences/journals, etc. Multiple rows in each table may refer 
to the same object. Similar problems may arise in Web search involving names 
of people, products, and organizations. 

The goal of record linkage is to partition rows in each table into equivalence 
classes, all rows in a class being references to one real-world entity. Obviously, 
knowing that two different rows in the author table refer to the same person 
(e.g., one may abbreviate the first name) may help us infer that two rows in the 
paper table refer to the same real-world paper. 

A veriety of new techniques are being brought to bear on record linkage [10] 
and coreferent resolution [20], and this is an exciting area of current research. 

1 Many modern part-of-speech taggers are in turn driven by state transition models. 
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3.3 Bootstrapping Ontologies from the Web 

The set of entity types of interest to a search system keeps growing and changing. 
A fixed set of types and entities may not keep up. The system may need to 
actively explore the corpus to propose new types and extract entities for old and 
new types. Eventually, we would like the system to learn how to learn. 

Suppose we want to discover instances of some type of entity (city, say) on 
the Web. We can exploit the massive redundancy of the Web and use some very 
simple patterns [16,8,1,13]: 

“cities” {“,”} “such as” NPList2 
NP1 {“,”} “and other cities” 

“cities” {“,”} “including” NPList2 
NP1 “is a city” 

Here { } denotes an optional pattern and NP is a noun phrase. These patterns 
are fired off as queries to a number of search engines. A set of rules test the 
response Web pages for the existence of valid instantiations of the patterns. A 
rule may look like this: 

NP1 “such as” NPList2 AND 

head(NPl)= “cities” AND 
properNoun(head(eaclr(NPList2))) 

=£- instanceOf(City,head(eaclr(NPList2))) 

KnowItAll [13] makes a probabilistic assessment of the quality of the extrac- 
tion by collecting co-occurrence statistics on the Web of terms carefully chosen 
from the extracted candidates and pre-dehned discriminator phrases. E.g., if X 
is a candidate actor, “X starred in” or “starring X” would be good discriminator 
phrases. KnowItAll uses the point-wise mutual information (PMI) formulation 
by Turney [24] to measure the association between the candidate instance I and 
the discriminator phrase D: PMI(/,D) = |Hits(Z? + /)|/|Hits(/)|. 

Apart from finding instances of types, it is possible to discover subtypes. E.g., 
if we wish to find instances of scientists, and we have a seed set of instances, we 
can discover that physicists and biologists are scientists, make up new patterns 
from the old ones (e.g. “scientist X” to “physicist X”) and improve our harvest 
of new instances. 

In Sections 3.5 and 3.6 we will see how automatic extraction of ontologies 
can assist next-generation search. 

3.4 Searching Relational Data with NL Queries 

In this section and the next (§3.5), we will assume that information extraction 
and alias analysis have led to a reasonably clean entity-relationship (ER) graph. 
The graphs formed by nodes corresponding to authors, papers, conferences and 
journal in DBLP, and actors/actresses, movies, awards, genres, ratings, produc- 
ers and music directors in the Internet Movie Database (IMDB) are examples 
of reasonably clean entity-relationship data graphs. Other real-life examples in- 
volve e-commerce product catalogs and personal information management data, 
with organizations, people, locations, emails, papers, projects, seminars, etc. 

There is a long history of systems that give a natural language interface 
(NLI) to relational engines [4], but, as in general NLP research, recent work has 
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moved from highly engineered solutions to arbitrarily complex problems to less 
knowledge-intensive and more robust solutions for limited domains [21]. E.g., for 
a table JQB(description, platform, company) and the NL query What are the 
HP jobs on a UNIX system?, the translation to SQL might be select distinct 
description from JOB WHERE company = ’HP’ and platform = ’UNIX’. 
The main challenge is to agree on a perimeter of NL questions within which 
an algorithm is required to find a correct translation, and to reliably detect 
when this is not possible. 

3.5 Searching Entity-Relationship Graphs 

NLI systems take advantage of the precise schema information available with 
the “corpus” as well the well-formed nature of the query, even if it is framed 
in uncontrolled natural language. The output of IE systems has less elaborate 
type information, the relations are shallower, and the questions are most often 
a small set of keywords, from users who are used to Web search and do not wish 
to learn about any schema information in framing their queries. 

Free-form keyword search in ER graphs raises many interesting issues, in- 
cluding the query language, the definition of a “document” in response to a 
query, how to score a document which may be distributed in the graph, and how 
to search for these subgraphs efficiently. 

Multiple words in a query may not all match within a single row in a sin- 
gle table, because ER graphs are typically highly normalized using foreign key 
constraints. In an ER version of DBLP, paper titles and author names are in 
different tables, connected by a relation wrote (author, paper). In such cases, 
what is the appropriate unit of response? Recent systems [6,3,17] adopt the view 
that the response should be some minimal graph that connects at least one node 
containing each query keyword. 

Apart from type-free keyword queries, one may look for a single node of a 
specified type (say, a paper) with high proximity to nodes satisfying various 
predicates, e.g., keyword match (“indexing”, “SIGIR”) or conditions on numeric 
fields (year<1995). Resetting random walks [5] are a simple way to answer such 
queries. These techniques are broadly similar to Pagerank [9], except that the 
random surfer teleports only to nodes that satisfy the predicates. Biased random 
walks with restarts are also related to effective conductance in resistive networks. 
In a large ER graph, it is also nontrivial to explain to the user why/lrow enti- 
ties are related; this is important for diagnostics and eliciting user confidence. 
Conductance-based approaches work well [14]: we can connect +1 V to one node, 
ground the other, penalize lrigh-fanout nodes using a grounded sink connected 
to every node, and report subgraphs that conduct the largest current out of the 
source node. 

Recent years have seen an explosion of analysis and search systems for ER 
graphs, and I expect the important issues regarding meaningfulness of results 
and system scalability to be resolved in the next few years. 

3.6 Open-Domain Question Answering 

Finally, the Web at large will continue to be an “open-domain” system where 
comprehensive and accurate entity and relation extraction will remain elusive. 
No schema of entities and relationships can be complete at any time, even if 
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they become more comprehensive over time. Moreover, even a cooperative user 
will not be able to remember and exploit a universal “type system” in asking 
questions. Instead, search systems will provide some basic set of roles [15] that 
apply broadly. Questions will express roles or refinements of roles, and will be 
matched to probabilistic role annotations in the corpus. 

In open-domain QA, question analysis and response scoring will necessarily 
be far more tentative. Some basic machine learning will reveal that the question 
When was television invented? expects the type of the answer (atype) to be a 
date , and that the answer is almost certainly only a few tokens from the word 
television or its synonym. In effect, current technology [22,2,12,23] can translate 
questions into the general form 

find x from corpus where x InstanceOf (Atype (question) ) 
and x RelatedTo GroundConstants (question) 

Here Atype (question) represents the concept of time, and we are looking for a 
reference to an entity x which is an instance of time. (This is where a system like 
KnowItAll comes into play.) In the example above, television or TV would 

be in GroundConstants (question) . 

Checking the predicate RelatedTo is next to impossible in general. QA sys- 
tems employ a variety of approximations. These may be as crude as linear prox- 
imity (the number of of tokens separating x from GroundConstants (question) . 
Linear proximity is already surprisingly effective [23]. More sophisticated sys- 
tems 2 attempt a parse of the question and the passage, and verify that x and 
GroundConstants (question) are related in a way specified by (a parse of) the 
question. As might be expected, there is a trade-off beteen speed and robustness 
on one hand and accuracy and brittleness on the other. 

4 Conclusion 

Many of the pieces required for better searching are coming together. Current 
an upcoming research will introduce synergy as well as build large, robust ap- 
plications. The applications will need to embrace bootstrapping and life-long 
learning better than before. The architecture must isolate feature extraction, 
models, and algorithms for estimation and inferencing. The interplay between 
processing stages makes this goal very challenging. The applications must be 
able to share models and parameters across different tasks and across time. 
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Machine learning and data mining systems have achieved many impressive suc- 
cesses, but to become truly widespread they must be able to work with less help 
from people. This requires automating the data cleaning and integration process, 
handling multiple types of objects and relations at once, and easily incorporat- 
ing domain knowledge. In this talk, I describe how we are pursuing these aims 
using Markov logic networks, a representation that combines first-order logic 
and probabilistic graphical models. Data from multiple sources is integrated by 
automatically learning mappings between the objects and terms in them. Rich 
relational structure is learned using a combination of ILP and statistical tech- 
niques. Knowledge is incorporated by viewing logic statements as soft constraints 
on the models to be learned. Application to a real-world university domain shows 
our approach to be accurate, efficient, and less labor-intensive than traditional 
ones. 

This work, joint with Parag and Matthew Richardson, is described in further 
detail in Richardson and Domingos [1], Richardson and Domingos [2], and Parag 
and Domingos [3]. 
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Abstract. The scientific analysis of data is only around a century old. For most 
of that century, data analysis was the realm of only one discipline - statistics. 
As a consequence of the development of the computer, things have changed 
dramatically and now there are several such disciplines, including machine 
learning, pattern recognition, and data mining. This paper looks at some of the 
similarities and some of the differences between these disciplines, noting where 
they intersect and, perhaps of more interest, where they do not. Particular issues 
examined include the nature of the data with which they are concerned, the role 
of mathematics, differences in the objectives, how the different areas of appli- 
cation have led to different aims, and how the different disciplines have led 
sometimes to the same analytic tools being developed, but also sometimes to 
different tools being developed. Some conjectures about likely future develop- 
ments are given. 



1 Introduction 

This paper gives a personal view of the state of affairs in data analysis. That means 
that inevitably I will be making general statements, so that most of you will be able to 
disagree on some details. But I am trying to paint a broad picture, and I hope that you 
will agree with the overall picture. 

We live in very exciting times. In fact, from the perspective of a professional data 
analyst, I would say we live in the most exciting of times. Not so long ago, analysing 
data was characterised by drudgery, by manual arithmetic, and the need to take great 
care over numerical trivia. Nowadays, all that has been swept aside, with the burden 
of tedium having been taken over by the computer. What we are left with are the 
high-level interpretations and strategic decisions; we look at the summary values 
derived by the computers and make our statements and draw conclusions and base 
our actions on these. It is clear from this that the computer has become the essential 
tool for data analysis. 

But there is more. The computer has not merely swept aside the tedium. The awe- 
some speed of numerical manipulation has permitted the development of entirely new 
kinds of data analytic tools, being applied in entirely new ways, to entirely new kinds, 
and indeed sizes, of data sets. The computer has given us new ways to look at things. 
The old image, that data analysis was the realm of the boring obsessive, is now so 
diametrically opposite to the new truth as to be laughable. 
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This paper describes some of the history, some of the tools, and something of how 
I see the present status of data analysis. So perhaps I should begin with a definition. 
Data analysis is the science of discovery in data, and of processing data to extract 
evidence so that one can make properly informed decisions. In brief, data analysis is 
applied philosophy of science: the theory and methods, not of any particular scientific 
discipline itself, but of how to find things out. 



2 The Evolution of Data Analytic Disciplines 

The origins of data analysis can be traced back as far back as one likes. Think of 
Kepler and Gauss analysing astronomical data, of Florence Nightingale using plots to 
demonstrate that soldiers were dying because of poor hygiene rather than military 
action, of Quetelet’s development of ‘social mechanics’, and the fact that world’s 
oldest statistical society, the Royal Statistical Society, was established in 1834. But 
these ‘origins’ really only represent the initial stirrings: it wasn’t until the start of the 
20th century that a proper scientific discipline of data analysis really began to be 
formed. That discipline was statistics, and for the first half of the 20th century statis- 
tics was the only data analytic game in town. Until around 1950, statistics was the 
science of data analysis. (You will have to permit me some poetic leeway in my 
choice of dates: 1960 might be more realistic.) 

Then, around the middle of the 20th century, the computer arrived and a revolution 
began. Statistics began to change rapidly in response to the awesome possibilities the 
computer provided. There is no doubt that, had statistics been born now, at the start of 
the 21st century, rather than 100 years ago at the start of the 20th, it would be a very 
different kind of animal. (Would we have the f-test?.) Moreover, although statistics 
was the intellectual owner of data analysis up until about 1950, it was never the intel- 
lectual owner of data per se, and in the following decades other changes occurred 
which were to challenge the position assumed by statistics. In particular, another 
discipline grew up, whose primary responsibility was, initially, the storage and ma- 
nipulation of data. From data manipulation to data analysis was then hardly a large 
step. Statistics was no longer the only player. 

Nowadays, of course, computer science has grown into a vast edifice, and different 
subdisciplines of it have developed as specialised areas of data analysis, all overlap- 
ping with each other and overlapping with their intellectual parent, statistics. These 
subdisciplines include machine learning, pattern recognition, and data mining, and 
one could arguably include image processing, neural networks, and perhaps even 
computational learning theory and other areas also. I cannot avoid remarking that 
Charles Babbage, typically regarded as one of the fathers of computing with his ana- 
lytical engine, would have been fascinated by these developments: he was also one of 
the founders of the Royal Statistical Society. Of course, these various data analytic 
disciplines are not carbon copies of each other. They have subtly different aims and 
emphases, and often deal with rather different kinds of data (e.g. in terms of data set 
size, correlations, complexities, etc.). One of my aims in this talk is to examine some 
of these differences. Moreover, if the computer has been the strongest influence lead- 
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ing to the development of new data analytic technologies, application areas have 
always been and continue to have a similar effect. Thus we have psychometrics, bio- 
informatics, chemometrics, technometrics, and other areas, all addressing the same 
sorts of problems, but in different areas. I shall say more about this below. 



3 Data 

I toyed briefly with the idea of calling this talk 'analysing tomorrow’s data’ since one 
of the striking things about the modern world of data analysis is that the data with 
which we now have to deal could not have been imagined 100 years ago. Then the 
data had to be painstakingly collected by hand since there was no alternative, but 
nowadays much data acquisition is automatic. This has various consequences. 

Firstly, astronomically vast data sets are readily acquired. Books on data mining 
(e.g. [2], [3]), which is that particular data analytic discipline especially concerned 
with analysing large data sets, illustrate the sorts of sizes which are now being en- 
countered. The word terabyte is no longer unusual. When I was taught statistical data 
analysis, I was taught that first one must familiarise oneself with one’s data: plot it 
this way and that, look for outliers and anomalies, fit simple models and examine 
diagnostics. With a billion data points (one of the banking data sets I was presented 
with) this is clearly infeasible. Other problems involve huge numbers of variables, 
and perhaps relatively few data points, posing complex theoretical as well as practical 
questions: bioinformatics, genomics, and proteomics are important sources of such 
problems. 

Secondly, one might have thought that automatic data acquisition would mean bet- 
ter data quality, since there would be no subjective human intervention. Unfortu- 
nately, this has not turned out to be the case. New ways of collecting data has meant 
new ways for the data collection process to go wrong. Worse, large data sets can 
make it more difficult to detect many of the data anomalies. 

Data can be of low quality in many ways: individual values may be distorted or ab- 
sent, entire records may be missing, measurement error may be large, and so on. As 
discussed below, much of statistics is concerned with inference - with making state- 
ments about objects or values not seen or measured, on the basis of those which have 
been. Thus we might want to make statements about other objects from a population, 
or about the future behaviour of objects. Accurate inferences can only be made if one 
has accurate information on how the data were collected. Statisticians have therefore 
predicated their analyses on the assumption that the available observations were 
drawn in well-specified ways, or that the departures from these ways were understood 
and could be modelled. Unfortunately, with many of today’s data sets, such assump- 
tions often cannot be made. This has sometimes made statisticians (quite properly) 
wary of analysing such data. But the data still have to be analysed: the questions still 
need answers. This is one reason why data mining has been so successful, at least at 
first glance. Data miners have been prepared to examine distorted data, and to attempt 
to draw conclusions about it. It has to be said, however, that often that willingness has 
arisen from a position of ignorance, rather than one of awareness of the risks that 
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were being taken. Relatively few reports of the conclusions extracted from a data 
mining exercise, for example, qualify those conclusions with a discussion of the pos- 
sible impact of selectivity bias on the data being analysed. This is interesting because, 
almost by definition, data mining is secondary data analysis: the analysis of data col- 
lected for some other purpose. The data may be of perfect quality for its original pur- 
pose (e.g. calculating your grocery bill in the store), but of poor quality for subse- 
quent mining (e.g. because some items were grouped together in the bill). 

A third difference between many modern data analysis problems and those of the 
past is that nowadays they are often dynamic. Electronics permit data to be collected 
as things happen, and this opens the possibility of of making decisions as the data are 
collected. An example is in commercial transactions, where a customer can supply 
information and expects an immediate decision. In such circumstances one does not 
have the luxury of taking the data back to one’s laboratory and analysing it at leisure. 
Speech recognition is another example. This issue has led to new kinds of analytic 
tools, with an emphasis on speed and not merely accuracy. No particular area of data 
analysis seems to have precedence for such problems, but the computer science side, 
perhaps especially machine learning clearly regards such problems as important. 

Although every kind of data analytic discipline must contend with all kinds of 
data, there is no doubt that different kinds are more familiar in different areas. Com- 
putational areas probably place more emphasis on categorical data than on continuous 
data, and this is reflected in the types of data analytic tools (e.g. methods for extract- 
ing association rules) which have been developed. 

4 The Role of Mathematics 

Modern statistics is often regarded as a branch of mathematics. This is entirely inap- 
propriate. Indeed, the qualitative change induced by the advent of the computer 
means that statistics could equally be regarded as a branch of computer science. 

In a sense statistics, and data analysis more generally, is the opposite of mathemat- 
ics. Mathematics begins with assumptions about the structure of the universe of dis- 
course (the axioms) and seeks to deduce the consequences. Data analysis, on the other 
hand, begins with observations of the consequences (the data) and seeks to infer 
something about the structure of the universe. One consequence of this is that one can 
be a good mathematician without understanding anything about any area to which the 
mathematics will be applied - one primarily needs facility with mathematical symbol 
manipulation - but one cannot be a good statistician without being able to relate the 
analysis to the world from which the data arose. This is why one hears of mathemat- 
ics prodigies, but never statistics prodigies. Analysis requires understanding. 

There are other differences as well. Nowadays a computer is an essential and in- 
dispensable tool for statistics, but one can still do much mathematics without a com- 
puter. This is brought home to our undergraduate students, taking mathematics de- 
grees, with substantial statistical components, when they come to use software: 
statistical software packages such as Splus, R, SAS, SPSS, Stata, etc., are very differ- 
ent from mathematical packages such as Maple and Mathematica. Carrying out even 
fairly basic statistical analyses using the latter can be a non-trivial exercise. 
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David Finney has commented that it is no more true to describe statistics as a 
branch of mathematics than it would be to describe engineering as a branch of mathe- 
matics, and John Nelder has said ‘ The main danger, I believe, in allowing the ethos of 
mathematics to gain too much influence in statistics is that statisticians will be 
tempted into types of abstraction that they believe will be thought respectable by 
mathematicians rather than pursuing ideas of value to statistics.’ 

There is no doubt that the misconception of statistics as mathematics has been det- 
rimental in the past, especially in commercial and business applications. Data mining, 
in particular took advantage of this - its very name spells glamour and excitement, the 
possibility of gaining a market edge for free. But there are also other examples where 
the image of statistics slowed its uptake. For example, experimental design (that 
branch of statistics concerned with efficient and cost effective ways to collect data) 
was used in only relatively few sectors (mostly manufacturing). Reformulations of 
experimental design ideas under names such as the Taguchi method and Six Sigma, 
however, have had a big impact. If anything ought to convince my academic col- 
leagues of the power of packaging and presentation, then it should be these examples. 

5 Several Cultures Separated by a Common Language 

The writer George Bernard Shaw once described England and America as ‘two cul- 
tures divided by a common language’, and I sometimes feel that the same applies to 
the various data analytic disciplines. Over the years, I have seen several intense de- 
bates between proponents of the different disciplines. Part of the reason for this lies in 
the different philosophical approaches to investigation. Statistics, perhaps because of 
its mathematical links, places a premium on proof and mathematical demonstration of 
the properties of data analytic tools. For example, demonstrating mathematically that 
an algorithm will always converge. Machine learning, on the other hand, places more 
emphasis on empirical testing. Of course there is overlap. Most methodological statis- 
tics papers include at least one example of the methods applied to real problems, and 
most machine learning papers describe the ideas in mathematical terms, but there is a 
clear difference in what is regarded as of central importance. 

Another reason for the debates has been that many of the ideas were developed in 
parallel, by researchers naturally keen to establish their priority and reputation. This 
led to claims to the effect that ‘we developed it first’ or ‘we demonstrated that prop- 
erty years ago.’ This was certainly evident in the debates on recursive partitioning 
tree classifiers, which were developed in parallel by the machine learning and statis- 
tics communities. 

Misunderstandings can also arise because different schools place emphasis on dif- 
ferent things. Early computer science perspectives on data mining stressed the finding 
of patterns in databases. This is perfectly natural: it is something often required (e.g. 
what percentage of my employees earn more than €x p.a.?). However, this is of lim- 
ited interest to a statistician, who will normally want to make an inference to a wider 
population or to the future (e.g. what percentage of my employees are likely to earn 
more than £v p.a. next year?). Much work on association analysis has ignored this 
inferential aspect. Moreover, much work has also made a false causal assumption: 
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while it is interesting to know that ten times as many people who bought A also 
bought B, it is valuable to know that if people can be induced to buy A they will also 
buy B, and the two are not the same. 

While there have been tensions between the different areas when they develop 
similar models, each from their own perspective, there is no doubt that these tensions 
can be immensely beneficial from a scientific perspective. A nice example of this is 
the work on feedforward neural networks. These originally came from the computer 
(or, one might argue, the cybernetics, electrical engineering, or even biological) side 
of things. The perspective of a set of fairly simple interacting processors dominated. 
Later, however, statisticians became involved and translated the ideas into mathe- 
matical terms: such models can be written as nested sequences of nonlinear transfor- 
mations of linear combinations of variables. Once written in fairly standard terms, 
one can apply the statistical results of a century of theoretical work. In particular, one 
could explain that the early neural network claims of very substantial improvement in 
predictive power were likely to be in large part due to overfitting the design data, and 
to present ideas and tools for avoiding this problem. Of course, nowadays all these are 
well understood by the neural network community, but this was certainly not the case 
in the early days (I can remember papers presenting absurdly overoptimistic claims), 
even though statisticians had known about the issues for decades. 

If the computer is leading to a unification of the data analytic schools, so also are 
some theoretical developments. The prime examples here, of course, are Bayesian 
ideas. Bayes’s theorem tells us how we should update our knowledge in the light of 
new information. This is the very essence of learning, so it is not surprising that ma- 
chine learning uses these ideas. With the advent of practical computational tools for 
evaluating high dimensional integrals, such as MCMC, statistics has also undergone a 
dramatic Bayesian revolution, not only in terms of dynamic updating models but also 
in terms of model averaging. Indeed, model averaging, like the understanding of 
overfitting (indeed, closely connected to it), has led to deep theoretical advances. 
Tools such as boosting and bagging are based on these sorts of principles. Boosting, 
in particular, is interesting from our perspective because it illustrates the potential 
synergy which can arise from the disparate emphases of the different disciplines. 
Originally developed by the machine learning community, who proposed it on fairly 
intuitive grounds and showed that it worked in practical applications, it was then 
explored theoretically by statisticians, who showed its strong links to generalised 
additive models, a well-understood class of statistical tools. The most recent tool to 
experience this initial development, followed by an exposure to the ideas and view- 
points of other data analytic disciplines, is that of support vector machines. 

In fact, perceptrons (the progenitor of support vector machines) and logistic dis- 
crimination provide a very nice illustration of the difference in emphasis between, in 
this case, statistical and machine learning models for classification. Logistic discrimi- 
nation fits a model to the probability that an object with given features x will belong 
to class 0 rather than class 1. Typically, the model is fitted by finding the parameters 
which maximise the design set log likelihood: 
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log L oc ^ log /? (0 | X ; ) . (1) 

i = 1 

Classification is then effected by comparing an estimated probability with a threshold. 
It is immediately clear from (1) that all design set data points contribute - it is really 
an average of contributions. This is fine if one’s model /3(0|x) has the form of the 

‘true’ function p(0 | x) . But this is a brave assumption. It is likely that the model is 

not perfect. If so, one must question the wisdom of letting data points with estimated 
probability far from the classification threshold contribute the same amount to the fit 
criterion (1) as do those near to it (see [4]). In contrast, perceptron models focus at- 
tention on whether or not the design set points are correctly classified: quality of fit of 
a model far from the decision surface, which is broadly irrelevant to classification 
performance, does not come into it. 

An example of another area which has been developed in rather different ways by 
different disciplines is the area I call pattern discovery. This is the search for, identifi- 
cation of, and description of anomalously high local densities of data points. The 
computer science literature has focused on algorithms for finding such configurations. 
In particular, a great deal of work has occurred when the data are character strings, in, 
especially text search (e.g. web search engines) and nucleotide sequences. In contrast, 
the statistical work has concentrated on the inference problem, developing scan statis- 
tics for deciding whether a local anomaly represents a real underlying structure is just 
random variation of a background model. Ideas of this kind have been developed in 
many application areas, including bioinformatics, technical stock chart analysis, as- 
tronomy, market basket analysis, and others, but the realisation that they are all tack- 
ling very similar problems appears to be only recent. 

Implicit in the last two paragraphs is one of the fundamental differences in empha- 
sis between computational and statistical approaches to data analysis - again an un- 
derstandable difference in view of their origins. This is the emphasis of the computa- 
tional approaches on algorithms (e.g. the perceptron error-correcting algorithm) and 
the emphasis of the statistical approaches on models (e.g. the logistic discrimination 
model). Both algorithms and models are, of course, important when tackling real 
problems. 

It is my own personal view that one can also characterise the difference between 
the two perspectives, at least to some extent, in terms of degree of risk. The computa- 
tional schools seem often prepared to try something without the assurance that it will 
work, or that it will always work, but in the hope (or knowledge from previous analy- 
ses) that it will sometimes work. The statistical schools seem more risk averse, requir- 
ing more assurance before carrying out an analysis. Perhaps this is illustrated by the 
approaches to pattern discovery mentioned above: the data mining community devel- 
ops algorithms with which to detect possible patterns, while the statistical community 
develops tools to tell whether they are real or merely chance. Once again, both per- 
spectives are valuable, especially in tandem: adventurous risk-taking offers the possi- 
bility of major breakthroughs, while careful analysis shows one that the method gives 
reliable results. 
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6 Future Tools and Application Areas? 

Of course, the various data analytic disciplines are constantly evolving. We live in 
very exciting times because of the tools which have been developed over the past few 
decades, but that development has not stopped. If anything, it has accelerated and will 
continue to do so as the computational infrastructure continues to develop. This 
means faster and larger (in terms of all dimensions of datasets). Judging from the 
past, this will translate into analytic tools about which one previously could only have 
dreamt, and, further, into tools one could not even have imagined. 

If the computer is one force driving the development of new data analytic tools, I 
can see at least two others. 

The first of these are application areas, mentioned above. Certainly, the growth of 
statistics over the 20th century was strongly directed by the applications. Thus agri- 
cultural requirements led to the early development of experimental design, psychol- 
ogy motivated the development of factor analysis and other latent variable models, 
medicine led to survival analysis, and so on. In other areas, speech recognition stimu- 
lated work on hidden Markov models, robotics stimulated work on reinforcement 
learning, etc. Of course, once developments have been started, and the power of the 
tools being developed has been recognised, other application areas rapidly adopt the 
tools. 

As with the impact of developing computational infrastructure, I see no reason to 
expect this influence of application areas to stop. We are currently witnessing the 
particular requirements of genomic, proteomic, and related data leading to the devel- 
opment of new analytic tools; for example, methods for handling fat data - data in- 
volving many (perhaps tens of thousands of) variables, but few (perhaps a few tens 
of) data points. Mathematical finance is likewise an area which is shifting its centre of 
gravity towards analysis. Until recently characterised by mathematical areas such as 
stochastic calculus, it is increasingly recognised that data analysis is also needed - the 
values of the model parameters must come from somewhere. More generally, the area 
of personal finance is beginning to provide a rich source of novel problems, requiring 
novel solutions. The world wide web, of course, is another source of new types of 
data, and new problems. This area, in particular, is a source of data which is charac- 
terised by its dynamic properties, and I expect the analysis of dynamic data to play an 
even more crucial role in future developments. Decisions in telecoms systems, even 
in day-to-day purchasing transactions, are needed now, not after a leisurely three 
months’ analysis of a customer’s track record and characteristics. Delay loses busi- 
ness. 

The second additional driving force I can see is also not really a new one. It has 
always been with us, but it will lead to the development of new kinds of tools, in 
response to new demands and also enabled by the advancing computational infra- 
structure. This is the need to model finer and finer aspects of the presenting problems. 
A recent example of this is in the analysis of repeated measures data. The last two 
decades have witnessed a very exciting efflorescence of ideas for tackling such data. 
The essential problem is to recognise and take account of the fact that repeated meas- 
urements data are likely to be correlated (with the (multiple) series being too short to 
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use time series ideas). Classical assumptions of independence are all very well, but 
more accurate models and predictions result when the dependence is modelled. An- 
other example of such 'finer aspects’ of the presenting problem, which has typically 
been ignored up until now, is the fact that predictive models are likely to be applied to 
data drawn from distributions different from that from which the design data were 
drawn (perhaps a case for dynamic models). There are many other examples. 

There is, however, a cautionary comment to be made in connection with this driv- 
ing force. It is easy to go too far. There is little point is developing a method to cope 
with some aspect of the data if the inaccuracies induced by that aspect are trivial in 
comparison with those arising from other causes. Data analysis is not a merely 
mathematical exercise of data manipulation. 

If we data analysts live in exciting times, I think it is clear that the future will be 
even more exciting. Looking back on the past it is obvious that the tensions between 
the different data analytic disciplines have, in the end, been beneficial: we can learn 
from the perspectives and emphases of the other approaches. In particular, we should 
learn that the other disciplines can almost certainly shed light on and help each of us 
gain greater understanding of what we are trying to do. We should look for the syner- 
gies, not the antagonisms. 

I’d like to conclude with two quotations. The first is from John Chambers, the 
computational statistician who developed Splus and who won the 1998 ACM Soft- 
ware System Award for that work. He wrote: ‘Greater statistics can be defined sim- 
ply, if loosely, as everything related to learning from data, from the first planning or 
collection to the last presen tation or report. Lesser statistics is the body of specifically 
statistical methodology that has evolved within the profession - roughly, statistics as 
defined by texts, journals, and doctoral dissertations. Greater statistics tends to be 
inclusive, eclectic with respect to methodology, closely associated with other disci- 
plines, and practiced by many outside of academia and often outside of professional 
statistics. Lesser statistics tends to be exclusive, oriented to mathematical techniques, 
less frequently collaborative with other disciplines, and primarily practiced by mem- 
bers of university departments of statistics. ’ [1] 

John has called the discipline of data analysis ‘greater statistics’, but I am sure we 
can all recognise what we do in his description. What we call it is not important. As 
Juliet puts it in Act II, Scene ii of Shakespeare’s Romeo and Juliet : 

‘What's in a name? that which we call a rose 
By any other name would smell as sweet. ’ 
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Abstract. Typical association rules consider only items enumerated in 
transactions. Such rules are referred to as positive association rules. Neg- 
ative association rules also consider the same items, but in addition con- 
sider negated items (i.e. absent from transactions). Negative association 
rules are useful in market-basket analysis to identify products that con- 
flict with each other or products that complement each other. They are 
also very convenient for associative classifiers, classifiers that build their 
classification model based on association rules. Many other applications 
would benefit from negative association rules if it was not for the expen- 
sive process to discover them. Indeed, mining for such rules necessitates 
the examination of an exponentially large search space. Despite their 
usefulness, and while they were referred to in many publications, very 
few algorithms to mine them have been proposed to date. In this paper 
we propose an algorithm that extends the support-confidence framework 
with a sliding correlation coefficient threshold. In addition to finding con- 
fident positive rules that have a strong correlation, the algorithm discov- 
ers negative association rules with strong negative correlation between 
the antecedents and consequents. 



1 Introduction 

Association rule mining is a data mining task that discovers relationships among 
items in a transactional database. Association rules have been extensively stud- 
ied in the literature for their usefulness in many application domains such as 
recommender systems, diagnosis decisions support, telecommunication, intru- 
sion detection, etc. The efficient discovery of such rules has been a major focus 
in the data mining research community. From the original apriori algorithm [1] 
there have been a remarkable number of variants and improvements of associa- 
tion rule mining algorithms [2]. 

Association rule analysis is the task of discovering association rules that 
occur frequently in a given data set. A typical example of association rule mining 
application is the market basket analysis. In this process, the behaviour of the 
customers is studied when buying different products in a shopping store. The 
discovery of interesting patterns in this collection of data can lead to important 
marketing and management strategic decisions. For instance, if a customer buys 
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bread, what is the probability that lre/slre buys milk as well? Depending on 
the probability of such an association, marketing personnel can develop better 
planning of the shelf space in the store or can base their discount strategies on 
such associations/correlations found in the data. 

All the traditional association rule mining algorithms were developed to find 
positive associations between items. By positive associations we refer to associ- 
ations between items existing in transactions (i.e. items bought). What about 
associations of the type: “customers that buy Coke do not, buy Pepsi” or “cus- 
tomers that buy juice do not buy bottled water” ? In addition to the positive as- 
sociations, the negative association can provide valuable information, in devising 
marketing strategies. Interestingly, very few have focused on negative association 
rules due to the difficulty in discovering these rules. 

Although some researchers pointed out the importance of negative associ- 
ations [3], only few groups of researchers [4], [5], [6] proposed an algorithm to 
mine these types of associations. This not only illustrates the novelty of negative 
association rules, but also the challenge in discovering them. 



1.1 Contributions of This Paper 

The main contributions of this work are as follows: 

1. We devise a new algorithm to generate both positive and negative association 
rules. There are very few papers to discuss and discover negative association 
rules. Our algorithm differs from those in the sense that it uses a different 
interestingness measure and it generates the association rules from a different 
candidate set. 

2. To avoid adding new parameters that would make tuning difficult and thus 
impractical, we introduce an automatic thresholding on the correlation coef- 
ficient. We automatically and progressively slide the threshold to find strong 
correlations. 

3. We compare our algorithm with other existing algorithms that can generate 
negative association rules and discuss their performances. 

The remainder of the paper is organized as follows: Section 2 gives an overview 
of the basic concepts involved in association rule mining. In Section 3 we intro- 
duce our approach for positive and negative rule generation based on correlation 
measure. Section 4 presents related work for comparison with our approach. Ex- 
perimental results are described in Section 5 along with the performance of our 
system compared to known algorithms. We summarize our research and discuss 
some future work directions in Section 6. 



2 Basic Concepts and Terminology 

This section introduces association rules terminology and some related work on 
negative association rules. 
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2.1 Association Rules 

Formally, association rules are defined as follows: Let X = ■■■in} be a set 

of items. Let V be a set of transactions, where each transaction T is a set of 
items such that T C X. Each transaction is associated with a unique identifier 
TID. A transaction T is said to contain A , a set of items in X, if X C T . An 
association rule is an implication of the form “ X => V” , where X C X, Y Cl, 
and XtlY = 0. The rule X => Y has a support s in the transaction set V if s% of 
the transactions in V contain All Y . In other words, the support of the rule is the 
probability that A and Y hold together among all the possible presented cases. 
It is said that the rule A =>• Y holds in the transaction set V with confidence 
c if c% of transactions in V that contain A also contain Y. In other words, the 
confidence of the rule is the conditional probability that the consequent Y is 
true under the condition of the antecedent A. The problem of discovering all 
association rules from a set of transactions V consists of generating the rules 
that have a support and confidence greater than given thresholds. These rules 
are called strong rules , and the framework is known as the support- confidence 
framework for association rule mining. 



2.2 Negative Association Rules 

Example 1. Suppose we have an example from the market basket data. In this 
example we want to study the purchase of organic versus non-organic vegeta- 
bles in a grocery store. Table 1 gives us the data collected from 100 baskets in 
the store. In Table 1 “organic” means the basket contains organic vegetables and 

organic” means the basket docs not contain organic vegetables. The same ap- 
plies for non-organic. On this data, let us find the positive association rules in the 
“support-confidence” framework. The association rule “non-organic — > organic” 
has 20% support and 25% confidence (supp (non-organic A organic) /supp (non- 
organic)). The association rule “organic — » non-organic” has 20% support and 
50% confidence (supp (non-organic A organic)/supp(organic)). The support is 
considered fairly high for both rules. Although we may reject the first rule on 
the confidence basis, the second rule seems a valid rule and may be considered 
in the data analysis. Now, let us compute the statistical correlation between the 
non-organic and organic items. A more elaborated discussion on the correlation 
measure is given in Section 3.1. The correlation coefficient between these two 
items is -0.61. This means that the two items are negatively correlated. This 
measure sheds a new light on the data analysis on these specific items. The rule 
“organic — * non-organic” is misleading. The correlation brings new information 
that can help in devising better marketing strategies. 

The example above illustrates some weaknesses in the “support-confidence” 
framework and the need for the discovery of more interesting rules. The interest- 
ingness of an association rule can be defined in terms of the measure associated 
with it, as well as in the form an association can be found. 

Brin et. al [3] mentioned for the first time in the literature the notion of 
negative relationships. Their model is chi-square based. They use the statistical 
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Table 1. Example 1 data Table 2. 2x2 contingency table 
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test to verify the independence between two variables. To determine the nature 
(positive or negative) of the relationship, a correlation metric was used. In [6] 
the authors present a new idea to mine strong negative rules. They combine 
positive frequent itemsets with domain knowledge in the form of a taxonomy 
to mine negative associations. However, their algorithm is hard to generalize 
since it is domain dependant and requires a predefined taxonomy. A similar 
approach is described in [7]. Wu et. al [4] derived a new algorithm for generating 
both positive and negative association rules. They add on top of the support- 
confidence framework another measure called mininterest for a better pruning of 
the frequent itemsets generated. In [5] the authors use only negative associations 
of the type X — > -Y to substitute items in market basket analysis. 

We define as generalized negative association rule, a rule that contains a 
negation of an item (i.e a rule for which its antecedent or its consequent can 
be formed by a conjunction of presence or absence of terms). An example for 
such association would be as follows: A A ~^B A ->C A D — » E A -<F. To the 
best of our knowledge there is no algorithm that can determine such type of 
associations. Deriving such an algorithm is not an easy problem, since it is well 
known that the itemset generation in the association rule mining process is 
an expensive one. It would be necessary not only to consider all items in a 
transaction, but also all possible items absent from the transaction. There could 
be a considerable exponential growth in the candidate generation phase. This is 
especially true in datasets with highly correlated attributes. That is why it is not 
feasible to extend the attribute space by adding the negated attributes and use 
the existing association rule algorithms. Although we are currently investigating 
this problem, in this paper we generate a subset of the generalized negative 
association rules. We refer to them as confined negative association rules. A 
confined negative association rule is one of the follows: ~^X — ■> Y, X — > -Y or 
~^X — > ~Y, where the entire antecedent or consequent must be a conjunction of 
negated attributes or a conjunction of non-negated attributes. 

3 Discovering Positive and Negative Association Rules 

The most common framework in the association rules generation is the “support- 
confidence” one. Although these two parameters allow the pruning of many 
associations that are discovered in data, there are cases when many uninterest- 
ing rules may be produced. In this paper we consider another framework that 
adds to the support-confidence some measures based on correlation analysis. 
Next section introduces the correlation coefficient, which we add to the support- 
confidence framework in this work. 
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3.1 Correlation Coefficient 



Correlation coefficient measures the strength of the linear relationship between a 
pair of two variables, ft is discussed in the context of association patterns in [8]. 
For two variables X and Y, the correlation coefficient is given by the following 
formula: 

(i) 

Cx<Jy 

In Equation 1, Cov(X, Y) represents the covariance of the two variables and 
ax stands for the standard deviation. The range of values for p is between -1 
and +1. If the two variables are independent then p equals 0. When p = +1 the 
variables considered are perfectly positive correlated. Similarly, When p = — 1 
the variables considered are perfectly negative correlated. A positive correlation 
is evidence of a general tendency that when the value of X increases/decreases so 
does the value of Y. A negative correlation occurs when for the increase/decrease 
of X value we discover a decrease/increase in the value of Y. 

Let X and Y be two binary variables. Table 2 summarizes the information 
about X and Y variables in a dataset in a 2x2 contingency table. The cells of 
this table represent the possible combinations of X and Y and give the frequency 
associated with each combination. N is the size of the dataset considered. 

Given the values in the contingency table for binary variables, Pearson in- 
troduced the <f> correlation coefficient which is given in the equation 2: 



/n/oo - fiofoi 
V/+0/+1/1+/0+ 



(2) 



We can transform this equation by replacing /oo, foi, /io> /o+ and f+o as follows: 



/n(Y - /10 - foi - fu) - fiofoi 

V/+0/+1/1+/0+ 



(3) 



/njV - fnfio - fufoi - fii - fiofoi 
V/+0/+1/1+/0+ 

, fuN — (/n + /io)(/n + foi) 

V/+0/+1/1+/0+ 

= Nf 11 - /i+ * f /+1 

\J fi+(N — /i+)/+i(Y — / +1 ) ' 



(4) 

(5) 

(6) 



The measure given in Equation 6 is the measure that we use in the association 
rule generation. 

Cohen [9] discusses about the correlation coefficient and its strength. In his 
book, he considers that a correlation of 0.5 is large, 0.3 is moderate, and 0.1 
is small. The interpretation of this statement is that anything greater than 0.5 
is large, 0.5-0. 3 is moderate, 0.3-0. 1 is small, and anything smaller than 0.1 is 
insubstantial, trivial, or otherwise not worth worrying about as described in [10]. 

We use these arguments to introduce an automatic progressive thresholding 
process. We start by setting our correlation threshold to 0.5. If no strong cor- 
related rules are found the threshold slides progressively to 0.4, 0.3 and so on 
until some rules are found with moderate correlations. This progressive process 
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eliminates the need for manually adjusted thresholds. It is well known that the 
more parameters a user is given, the more difficult it becomes to tune the system. 
Association rule mining is certainly not immune to this phenomenon. 

3.2 Our Algorithm 

Traditionally, the process of mining for association rules has two phases: first, 
mining for frequent itemsets; and second, generating strong association rules 
from the discovered frequent itemsets. In our algorithm, we combine the two 
phases and generate the relevant rules on-the-fly while analyzing the correla- 
tions within each candidate itemset. This avoids evaluating item combinations 
redundantly. Indeed, for each generated candidate itemset, we compute all pos- 
sible combinations of items to analyze their correlations. At the end, we keep 
only those rules generated from item combinations with strong correlation. The 
strength of the correlation is indicated by a correlation threshold, either given as 
input or by default set to 0.5 (see above for rational). If the correlation between 
item combinations X and Y of an itemset XY , where X and Y are itemsets, 
is negative, negative association rules are generated when their confidence is 
high enough. The produced rules have either the antecedent or the consequent 
negated: (~<X — > Y and X — » ~Y), even if the support is not higher than the 
support threshold. However, if the correlation is positive, a positive association 
rule with the classical support-confidence idea is generated. If the support is not 
adequate, a negative association rule that negates both the antecedent and the 
consequent is generated when its confidence and support are high. 

The algorithm generates all positive and negative association rules that have 
a strong correlation. If no rule is found, either positive or negative, the correlation 
threshold is automatically lowered to ease the constraint on the strength of the 
correlation and the process is redone. Figure 1 gives the detailed pseudo-code 
for our algorithm. 

Initially both sets of negative and positive association rules are set to empty 
(line 1). After generating all the frequent 1-itemsets (line 2) we iterate to generate 
all frequent k-itemsets, stored in Fk (line 8) . Fk is verified from a set of candidate 
Ck computed in line 4. The iteration from line 2 stops when no longer frequent 
itemsets are possible. Unlike the join made in the traditional Apriori algorithm, 
to generate candidates at level k, instead of joining frequent (k — l)-itemsets, 
we join the frequent itemsets at level k — 1 with the frequent 1-itemsets (line 4). 
This is because we want to extend the set of candidate itemsets and have the 
possibility to analyze the correlation of more item combinations. The rational 
will be explained later. Every candidate itemset generated this way is on one 
hand tested for support (line 7), and on the other hand used to analyze possible 
correlations even if its support is below the minimum support (loop from line 9 to 
22) . Correlations for all possible pair combinations for each candidate itemset are 
computed. For an itemset i and a pair combination (A, Y) such that i = X U Y , 
the correlation coefficient is calculated (line 10). If the correlation is positive and 
strong enough, a positive association rule of the type X — » Y is generated, if the 
supp{X U Y) is above the minimum support threshold and the confidence of the 
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Algorithm Positive and Negative Association Rules Generation 

Input TD, minsupp , minconf, and pmin , respectively Transactional Database, 

minimum support, minimum confidence, and correlation threshold. 

Output AR: Positive and Negative Association Rules. 

Method: 

(0) if pmin is undefined then pmin — 0.5 

(1) positiveAR 0; negativeAR < — 0 /*positive and negative AR sets*/ 

(2) scan the database and find the set of frequent 1-itemsets (Fi) 

(3) for (k = 2, Fk-i / 0, k + +){ 

(4) C k = F k - 1 M Fi 

(5) foreach i £ C k { 

(6) s=support(TD,i) /*support of item i is computed*/ 

(7) if s> minsupp then 

(8) F k F k U {i} /*item i is added to F k */ 

(9) foreach X,Y {i — X \JY) { 

(10) p=correlation(X,Y) /*correlation btw X and Y is computed*/ 

(11) if p > pmin then 

(12) if s> minsupp then 

(13) if confidence(X —> Y ) > minconf then 

(14) positiveAR positiveAR U {X — ► Y} 

(15) else if confidence(—>X — > -iY) > minconf and 

supp{— \X—iY) > minsupp then 

(16) negativeAR « — negativeAR U {-iX — ► -iY} 

(17) if p < -pmin then /*p < 0 and |p| > pmin */ 

(18) if confidence(X —* -iY) > minconf then 

(19) negativeAR < — negativeAR U {X —* —< Y} 

(20) if confidence^— 'X —* Y) > minconf then 

(21) negativeAR *■ — negativeAR U {— i X —* Y} 

( 22 ) } 

(23) } 

(24) } 

(25) AR < — positiveAR U negativeAR 

(26) if AR = 0 then { 

(27) Pmin — Pmin 0.1 

(28) if Pmin > 0 then go to step (3) 

(29) } 

(30) return AR 



Fig. 1 . Discovering positive and negative confined association rules 



rule is strong. Otherwise, if we still have a positive and strong correlation but 
the support is below the minimum support, a negative association rule of the 
type -iX — > ~Y is generated if its confidence is above the minimum confidence 
threshold (lines 15-16). On the other hand, if the correlation test gives a strong 
negative correlation, association rules of the types X -Y and ->X — » Y 
are generated and appended to the set of association rules if their confidence 
is adequate. The result is compiled by combining all discovered positive and 
negative association rules. Lines 26 onward, illustrate the automatic progressive 
thresholding for the correlation coefficient. If no rules are generated at a given 
correlation level, the threshold is lowered by 0.1 (line 27) and the process re- 
iterated. 

4 Related Work in Negative Association Rule Mining 

In this section, we discuss two known algorithms that generate negative associa- 
tion rules. We compare our approach with them later in the experiments section. 




34 



Maria-Luiza Antonie and Osmar R. Zaiane 



4.1 Negative Association Rule Algorithms 

We give a short description of the existing algorithms that can generate positive 
and negative association rules. For more details, please refer to [4] and [5]. 

First, we discuss the algorithm proposed by Wu et. al [4]. They add on top of 
the support-confidence framework another measure called mininterest (the argu- 
ment is that a rule A — > B is of interest only if supp(A\J B) — supp(A)supp(B) > 
mininterest) . The authors consider as itemsets of interest those itemsets (posi- 
tive or negative) that exceed minimum support and minimum interest thresholds. 
Although, [4] introduces the “mininterest” parameter, the authors do not dis- 
cuss how to set it and what would be the impact on the results when changing 
this parameter. The approach differs from our algorithm in that in our algo- 
rithm we use the correlation coefficient as measure of interestingness, which was 
thoroughly studied in the statistics community. In addition, the value of our 
parameter is well defined and it is not as sensitive to the dataset as the minin- 
terest parameter. In our algorithm (line 9) we compute the correlation coefficient 
for every pair X,Y of an item i where i = X U Y. As described earlier, when 
such a pair is found correlated an association rule is generated from it. In [4], 
they compute the interest for every pair X,Y of the item i where i = X U Y . 
However, they extract rules from itemset i only if any expression i = X U Y 
exceeds the minimum interest threshold. We claim that by adding this condition 
they are loosing some potential interesting association rules. In addition, in our 
algorithm the candidate set Ck is generated as a join between Ffc _ 1 and Fj. 
In [4] the candidate set Ck is generated as a union of two frequent itemsets in 
F t for 1 < i < k — 1. This turns out to be expensive. Since we all make the 
assumption that a k-itemset must have all its subsets in Ffc-i we prove in the 
next theorem that our join generates the same itemsets as in [4]. 

Theorem. All candidate items c £ Ck generated by F) cc Fj,l < i,j < k — 1 
for which 3 1 £ c such that t £ Ff s _i, can be discovered by Fk-i txi Fi . 

Proof. Let us suppose 3c € Ck such that c £ F* ixi Fj, 1 < i.j < k — 1 and 
c ^ Ffc_ i ixi F\. Given the condition stated in theorem 3 t £ c such that t £ Fk-\- 
Since c ^ Fk ~ i ixi F\ and t £ Fk-i it follows that c — t ^ F\. This is false as 
c — t is of length one and c £ Ck was generated from frequent itemsets. Thus 
Vc £ C k , c £ F k - 1 ex F 1 . Q.E.D 

Second, we present the algorithm proposed in [5]. The algorithm is named 
by the authors SRM (substitution rule mining). We refer to it in the same way 
throughout the paper. The authors develop an algorithm to discover negative 
associations of the type X — > -iY. These association rules can be used to dis- 
cover to which items are substitutes for others in market basket analysis. Their 
algorithm discovers first what they call concrete items, which are those item- 
sets that have a high chi-square value and exceed the expected support. Once 
these itemsets are discovered, they compute the correlation coefficient for each 
pair of them. From those pairs that are negatively correlated, they extract the 
desired rules (of the type X — » -Y). This paper, although interesting for the 
substitution items application, it is limited in the kind of rules that can discover. 
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Table 3. TD (a) Table 4. TD (b) 



TID 
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A,C,D 
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A,C,D 
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10 


A,D 



TID 


Items 


Equivalent bit vector 


1 


A, -iB, C, D, -iE, -iF 


(101100) 


2 


-<A, B, C, ->D, ->E, ->F 


(011000) 


3 


-iA, -iB, C, -iD, -iE, -iF 


(001000) 


4 


A, B, -iC, ~>D, ->E, F 


(110001) 


5 


A, -iB, C, D.-iE, -iF 


(101100) 


6 


-iA, -iB, -iC, -iD, E, -iF 


(000010) 


7 


->A, B, —'C, -iD, -iE, F 


(010001) 


8 


-•A, B, C, ->D, ->E, F 


(011001) 


9 


A, B, -iC, ->D, E, -iF 


(110010) 


10 


A, -iB, -<C, D, -iE, -iF 


(100100) 



Using the next example, which is an extension of the example presented in 
[5], we present some of the differences among the three algorithms. 

Example 2. Let us consider a small transactional table with 10 transactions 
and 6 items. In Table 3 a small transactional database is given. To illustrate the 
challenges in mining negative association rules we create another transactional 
database where for each transaction, the complement of each missing item is ap- 
pended to it. The new created dataset is shown in Table 4. This new database can 
be mined with the existing association rule mining algorithms. However, there 
are a few drawbacks of this naive approach. In practice, the data collections 
are very large, thus adding all the complemented items to the original database 
requires a large storage space. Not only the storage space has to increase consid- 
erably, but the execution times as well, in particular when the number of unique 
items in the database is very large. In addition, many association rules would be 
generated, many of them being of no interest to the applications at hand. 

Using a minimum support of 0.2, the following itemsets are discovered using 
the three discussed algorithms. For this example the correlation coefficient was 
set to 0.5, and the minimum interest to 0.07. 



Table 5. 2- itemsets 



Correlation 


Interest 


Concrete 


AD 


AD 


AD 


BF 


BF 


BF 


BD 


BD 


BD 


CE 


CE 






DF 





Table 6. 3-itemsets 



Correlation 


Interest 


Concrete 


ACD 




ACD 


ABC 


ABC 




ABD 




ABD 


BCD 







In Table 5 and Table 6, the first column presents the results when our ap- 
proach was used. The second column uses the algorithm from [4], while in the 
third one the results are obtained using the approach in [5]. In both tables the 
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positive itemsets are separated by the negative ones by a double horizontal line. 
The positive itemsets are in the upper part of the tables. As it can be seen, for 
the 2-itemsets all three algorithms find the same positive ones. The differences 
occur for the negative itemsets. The itemset DF has a minimum interest of 0.09, 
but it has a correlation of only 0.42. That is why it is not found by our approach 
or by the SRM algorithm [5] . The itemset CE is not found by SRM because their 
condition is that the itemset should have higher correlation than the minimum 
value. In our approach the condition is to be greater or equal. Since the itemset 
CE has a correlation of 0.5 it is discovered by our algorithm, but not by SRM. 

In Table 6 there are differences for both, the positive and the negative ones. 
The algorithm that uses the minimum interest parameter discovers only the 
ABC itemset because it is the only one that has all the pairs X,Y of the item 
ABC where ABC = XUY above the parameter. Although, all the other itemsets 
discovered by the other algorithms have at least two strong pairs they are not 
considered of interest. Our approach and SRM generate the same positive 3- 
itemset. The itemsets BCD and ABC are not discovered by SRM because none 
of its subsets of two items are generated as concrete during the process. 

From the itemsets that were shown in Table 5 and Table 6 a set of association 
rules can be generated. Here we show, some of the rules that were generated from 
the itemsets that were discovered by one algorithm, but not by others. From 
itemset CE, the association rule negE — » C can be found with support 0.5 and 
confidence of 62%. This rule seems to be strong, but it is missed by the SRM 
algorithm. From itemset DF, which is discovered only by the minimum interest 
algorithm, the association rules negD — » F and D — * -*F can be discovered. 
However, both rules have support 0.3 and confidence of 42%. These rules could 
have been eliminated when the confidence threshold is set to 50%, thus our 
approach and SRM do not miss much by not generating them. In addition, our 
approach generates the 3-itemset BCD. From this itemset the rule B — » ~^C^D 
is discovered and it has support of 0.2 and confidence of 60%. 



5 Experimental Results 

We conducted our experiments on a real dataset to study the behaviour of the 
algorithms compared. We used the Reuters-21578 text collection [11]. Reuters 
dataset had 6488 transaction, when only the ten largest categories were kept. 

We compare the three algorithms discussed in the sections above. For each 
algorithm a set of values for their main interestingness measure was used in the 
experiments. Our algorithm and SRM [5] had the correlation coefficient set to 
0.5, 0.4 and 0.3. In [4] the authors used the value 0.07 in their examples. We used 
this value and two others in its vicinity (0.05, 0.07 and 0.09). Each algorithm 
was run to generate a set of association rules. For lack of space the results are 
reported only for correlation coefficient 0.4 and minimum interest 0.07. For all 
the results, please see [12]. 

For these association rules a number of measures were computed: support 
(supp), confidence (conf), Piatetsky-Slrapiro measure (PS), Yule’s Q (Q), co- 




Mining Positive and Negative Association Rules 



37 



Table 7. Results for Reuters text collection 



j (a) Results for rules of type X — > Y j 
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| (b) Results for rules of type X — ► ->Y j 
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j (c) Results for rules of type —>X — » Y j 
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j (d) Results for rules of type —>X —> — > Y j 
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sine measure (IS) and the Jaccard measure (J). These measures evaluate the 
interestingness of the discovered pattern. For more details on these measures for 
frequent patterns see [13]. In [13] a set of measures are compared and discussed. 
The measures are clustered with respect to their similarity. We chose to compute 
a few measures from different clusters to ensure the diversity of evaluation. 

Tables 7 presents the results obtained for the Reuters dataset. We conducted 
the experiments with support 20% and confidence 0%. In each table a subset of 
the obtained rules are compared. Table 7 (a) compares rules of the type X — * Y, 
Table 7 (b) rules of the type X — ■> -Y , Table 7 (c) rules of the type ->X — > Y 
and Table 7 (d) rules of the type -<X — > -Y . In each table the average of 
the measurement and the standard deviation are reported. The value in bold 
represents the best value for each measure. 

Table 7 (a) shows that for positive association rules our approach tends to 
generate a more interesting set of rules compared to the other methods. 

For rules of type X — » ~Y (Table 7 (b)) our approach and SR.M perform 
best. They produce the same set of rules for correlation values of 0.4. 

Table 7 (c) and Table 7 (d) compare our approach with the one in [4] only, 
since SRM algorithm does not generate this kind of rules. 

In Table 7 (c) the symmetric rules of the ones in Table 7 (b) are generated, 
since the confidence is set to 0% and the correlation and minimum interest are 
computed for XY itemset. 

However, for the rules of type ->X — * -Y (Table 7 (d)) the method in [4] 
generates a smaller set of rules, but with higher values for the measures. 

6 Conclusions and Future Research Directions 

In this paper we introduced a new algorithm to generate both positive and neg- 
ative association rules. Our method adds to the support-confidence framework 
the correlation coefficient to generate stronger positive and negative rules. We 
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compared our algorithm with other existing algorithms on a real dataset. We 
discussed their performances on a small example for a better illustration of the 
algorithms and we presented and analyze experimental results for a text col- 
lection. The results prove that our algorithm can discover strong patterns. In 
addition, our method generates all types of confined rules, thus allowing to be 
used in different applications where all these types of rules could be needed or 
just a subset of them. 
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Abstract. In this paper, we present an experiment on knowledge dis- 
covery in chemical reaction databases. Chemical reactions are the main 
elements on which relies synthesis in organic chemistry, and this is why 
chemical reactions databases are of first importance. From a problem- 
solving process point of view, synthesis in organic chemistry must be 
considered at several levels of abstraction: mainly a strategic level where 
general synthesis methods are involved, and a tactic level where actual 
chemical reactions are applied. The research work presented in this paper 
is aimed at discovering general synthesis methods from chemical reaction 
databases in order to design generic and reusable synthesis plans. The 
knowledge discovery process relies on frequent levelwise itemset search 
and association rule extraction, but also on chemical knowledge involved 
within every step of the knowledge discovery process. Moreover, the over- 
all process is supervised by an expert of the domain. The principles of 
this original experiment on mining chemical reaction databases and its 
results are detailed and discussed. 

Keywords: knowledge discovery, data mining, frequent level-wise item- 
set search, association rule, knowledge-based system. 



1 Introduction 

In this paper, we present an experiment on the application of knowledge discov- 
ery algorithms for mining chemical reaction databases. Chemical reactions are 
the main elements on which relies synthesis in organic chemistry, and this is why 
chemical reaction databases are of first importance. From a problem-solving pro- 
cess point of view, synthesis in organic chemistry must be considered at several 
levels of abstraction: mainly a strategic level where general synthesis methods 
are involved, and a tactic level where actual chemical reactions are applied. The 
research work presented in this paper is aimed at discovering general synthe- 
sis methods from chemical reaction databases in order to design generic and 
reusable synthesis plans. This can be understood in the following way: mining 
reaction databases at the tactic level for finding synthesis methods at the strate- 
gic level. This knowledge discovery process relies on the one hand on mining 
algorithms, i.e. frequent levelwise itemset search and association rule extraction, 
and, on the other hand, on domain knowledge, that is involved at every step of 
the knowledge discovery process. 
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(c) Springer- Verlag Berlin Heidelberg 2004 




40 



Sandra Berasaluce et al. 



This research work is carried out within a long-term project for designing 
chemical information systems whose goal is to help a chemist building a synthe- 
sis plan [14, 19]. Actually, the general problem of synthesis relies on the design of 
a synthesis plan followed by an experimentation of this synthesis plan. Synthesis 
planning is mainly based on an analytical reasoning process, called retrosynthe- 
sis, where the first element of the plan is the target molecule, i.e. the molecule 
that has to be built (see fig. 1). This process can be likened to a goal-directed 
problem-solving approach: the target molecule is iteratively transformed by ap- 
plying reactions for obtaining simpler fragments, until finding starting materi- 
als that are easy to build or to obtain (this constitutes a synthesis pathway). 
For a given target molecule, a huge number of starting materials and reactions 
may exist, e.g. thousands of commercially available chemical compounds. Thus, 
exploring all the possible pathways issued from a target molecule leads to a 
combinatorial explosion. Therefore the choice of reaction sequences to be used 
within the planning process is of first importance, and strategies are needed for 
efficiently solving the synthesis planning problem. 

At present, reaction database management systems are the most useful tools 
for helping the chemist in synthesis planning. Other knowledge systems have 
been developed since the 70s for helping synthesis planning based on a retrosyn- 
thetic approach. The main problem in this kind of system is the constitution 
of the knowledge base. In our research work, we are designing a new kind of 
knowledge system for synthesis planning, combining the principles of knowledge 
systems, database systems, and knowledge discovery [4, 3, 19]. One aspect of this 
research is to study how data mining techniques may contribute to knowledge 
extraction from reaction databases, and beyond that, to the structuring of these 
databases and the improvement in their querying. 

This paper presents a preliminary experiment carried on two commercial re- 
action databases 1 using frequent itemset search and association rule extraction 
[2,16]. This study is original and novel within the domain of organic synthe- 
sis planning, and is of first importance, with respect to chemical researches. 
Regarding the knowledge discovery research, we stress the fact that knowledge 
extraction in an application domain has to be guided by knowledge domain if 
substantial results have to be obtained. Indeed, the knowledge extraction process 
is performed under the supervision of a domain expert, but the computing pro- 
cess itself has to be guided by domain knowledge, at every step, i.e. cleaning and 
transforming data, and interpreting results. We claim that the role of knowledge 
within the knowledge extraction process is most of the time underestimated, and 
one of the goal of this paper is to show that taking advantage of the function- 
alities of a knowledge system within the knowledge discovery process may be of 
first importance for obtaining accurate and realistic results. 

The paper is organized as follows. First, we introduce the chemical context, 
describing the synthesis problem. Then, we detail the selection and the pre- 
processing of data, i.e. organic synthesis reactions from reaction databases, and 
then, the application of data mining techniques to these data, namely frequent 

1 Supplied by Molecular Design Ltd - mdl (http://www.mdli.com). 
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itemset search and association rules extraction. We show how the results of such 
a knowledge discovery process give insights for information organization and re- 
trieval within reaction databases. Moreover, the extracted knowledge units after 
been validated by a chemist may be useful in the search for efficient reactions 
in association with a synthesis problem. Then we conclude the paper with a 
discussion regarding the present research work and the research perspectives. 



2 The Chemical Context 

2.1 The Synthesis Problem 

The information needs for a chemist solving a synthesis problem is related to a 
search in the literature for specific reactions solving synthesis problems consid- 
ered to be similar to the current one. There is a very huge number of specific 
reactions described within articles in the literature, certainly more than 10 mil- 
lions. Reaction documentation is complex and not yet standardized: many clas- 
sification systems have been proposed, based on reaction mechanism, or electron 
properties, but they are not really useful for studying synthesis in the large. 
Actually, the main questions for the synthesis chemist are related to chemical 
families to which a target molecule belongs, and to the synthesis methods, i.e. a 
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reaction or a sequence of reactions building structural patterns, to be used for 
building these families. For the sake of simplicity, we will use hereafter only the 
term “reaction” for mentioning a basic reaction or a synthesis method as well. 

Two main categories of reactions may be distinguished: reactions building 
the skeleton of a molecule -the arrangement of carbon atoms on which relies an 
organic molecule-, and reactions changing the functionality of a molecule, i.e. 
changing a function into another function (see fig. 2). In our framework, a func- 
tion is mainly used for recognizing a given molecule as a member of a chemical 
family, for predicting and explaining the molecule reactivity. It is defined as a 
connected molecular substructure composed of multiple carbon-carbon bonds, 
carbon-heteroatom bonds -an heteroatom is an atom that is not a carbon atom-, 
and heteroatom-heteroatom bonds. Here, we are mainly interested in reactions 
changing the functionality, and in the following questions: (i) what is the starting 
function Fi for a given formed function Fj? (ii) what are the reactions allowing 
the transformation of a function Fi into a function Fj? (iii) what are the functions 
Fi remaining unchanged during the application of a reaction? 

2.2 The Reaction Databases: Data Selection and Preprocessing 

The experiment reported hereafter has been carried out on two reaction data- 
bases, namely the “Organic Syntheses” database ORGSYN-2000 including 5486 
records, and the “Journal of Synthetic Methods” database jsm-2002 including 
75291 records. The selection of these databases relies on size and quality criteria. 
In these databases, the filtering of the data related to functional transformations 
has been performed within a data preprocessing step, where only structural 
information about the reaction has been considered (details are given in 3.2). 

The purpose of the preprocessing step of data mining is to improve the quality 
of the selected data by cleaning and normalizing the data. Reaction databases 
such as ORGSYN-2000 and jsm-2002 may be seen as a collection of records, where 
every record contains one chemical equation involving structural information, 
that can be read, according to the reaction model, as the transformation of an 
initial state -or the set of reactants- into a final state -or the set of products- 
associated with an atom-to-atom mapping between the initial and final states 
(see fig. 3). 

In our framework, data preprocessing has mainly consisted in exporting and 
analyzing the structural information recorded in the databases for extracting 
and for representing the functional transformations in a target format that has 
been processed afterwards. The considered transformations are functional mod- 
ifications, functional addition and deletion, i.e. adding or deleting a function. 
Moreover, no distinction has been made between one-step or multi-steps reac- 
tions. Errors in the atom-to-atom mapping have been neglected, as a reaction 
is considered at an abstract level, the so-called block level, as explained here- 
after. The abstraction of a reaction from the atom level into the block level is 
carried out using the RESYN-ASSISTANT system [19, 3] (some details on RESYN- 
ASSISTANT are given in § 3.2). 
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Fig. 3. The structural information on a reaction with the associated atom-to-atom 
mapping (reaction #13426 in the jsm-2002 database). 



In the following, we discuss the whole process of chemical reaction databases 
manipulation, involving data transformation and data mining, for retrieving and 
organizing chemical reaction databases. 



3 Knowledge Discovery in Reaction Databases 

3.1 An Overview of the Knowledge Discovery Process 

The knowledge discovery process in chemical reaction databases is considered as 
an interactive and iterative experimental process. An expert of the data domain, 
called hereafter the analyst, plays a central role in this process since he is in 
charge of controlling all the steps of the process (as discussed e.g. in [9,5]). 
According to given synthesis objectives, the analyst selects first the data to be 
analyzed, applies data mining modules for extracting knowledge units from data, 
and finally interprets and validates the units having a sufficient plausibility for 
being reused. For carrying out this process with benefits, the analyst may take 
advantage of his own knowledge of the domain -he is an expert-, and, as well, of 
a set of modules including a knowledge system, ontologies, molecule and reaction 
databases 2 . 

Hence, in our approach, the knowledge discovery process is, first of all, guided 
by the analyst and domain knowledge. The knowledge discovery process itself is 
based on frequent itemsets search and association rules extraction. Practically, 
the Close and the Pascal algorithms have been used for data processing [16, 15]. 
Their application and the results that have been obtained are discussed in the 
next sections. 



3.2 The Modeling of Reactions for Knowledge Discovery 

Data on organic reactions are generally recorded in databases within structural 
and textual entries: the former describes the structural formulae of substances 

More generally, the Web in the large could be taken into account if necessary. 



2 



44 



Sandra Berasaluce et al. 



in terms of “molecular graphs” while the latter refers to reaction conditions, 
names and roles of implied substances, bibliographical references, keywords and 
comments. In our experiment, we have been mainly interested in the so-called 
functionality changes -or interchanges- occurring during a reaction. These inter- 
changes can be recognized and represented by comparing the functionality of the 
reactants with that of the products: the removal of some (old) functions and the 
creation of some (new) functions can be made explicit. The comparison relies on 
the atom-to-atom mapping where functionality interchanges correspond to the 
substitution of an atom from one function to another. 

Formally, the representation of a reaction equation relies on the atom-to- 
atom mapping relation between the graphs of the reactants and the graphs 
of the products, defining three bond sets: the set of the broken or destroyed 
bonds, of formed bonds and of unchanged bonds. Actually these three function 
modifications correspond to subgoals that are achieved during the synthesis, 
preparing a main objective [7]. 

The knowledge system RESYN-ASSISTANT has been designed for assisting the 
chemist in the design of organic synthesis problems. In particular, the RESYN- 
ASSISTANT system is able to recognize the building blocks of a molecule, and 
among these blocks, the functional blocks, or more simply functions. Every func- 
tion is defined by a name, and is represented as a graph, called functional graph, 
modeling the structure of the function. The set of functional graphs is partially 
ordered by a subsumption relation based on a typed subgraph relation. In this 
way, the set of functional graphs constitutes a concept hierarchy, called Tif , that 
is part of the knowledge base of the RESYN-ASSISTANT system. Recognizing a 
function F k within a molecule say M means that the structure of M includes the 
functional graph F k as a subgraph. This recognition process is based on the clas- 
sification of M within the function hierarchy Hf. At present, the Ttf hierarchy 
includes about five hundred named functions. 

The RESYN-ASSISTANT system has been extended to recognize the building 
blocks of reactions. Based on the atom-to-atom mapping, the system establishes 
the correspondence between the recognized blocks of the same nature, and de- 
termines their role in the reaction. The abstraction of the reaction introduced 
in figure 3 from the atom-to-atom level into the block level is shown in figure 4. 
A function may be present in a reactant, in a product, or in both. In the last 
case, the function is unchanged. In the two other cases, the function in the re- 
actant is destroyed, or the function in the product is formed. During a reaction, 
either one or more reactant functions may contribute to form the functions in 




Fig. 4. The analysis of reaction #13426 in the jsm-2002 database in terms of blocks. 
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the products. At the end of the preprocessing step, the information obtained by 
the recognition process is incorporated into the representation of the reaction. 

For allowing the application of the algorithms Close and Pascal for frequent 
itemsets search, the data on reaction have to be transformed into a Boolean 
table. Thus, the representation of a molecule as a composition of functional 
blocks cannot be used in a straightforward way. Moreover, a reaction can be 
considered from two main points of view, depending on the fact that the atom- 
to-atom mapping is taken into account or not (see fig. 5): 

— a global point of view on the functionality interchanges leads to consider a 
single entry R corresponding to an analyzed reaction, to which is associated a 
list of properties, i.e. formed and/or destroyed and/or unchanged functions, 

— a specific point of view on the functionality transformations that is based 
on the consideration of a number of different entries Rk corresponding to the 
different functions being formed, i.e. the atom-to-atom mapping gives the 
explicit correspondence between the blocks that are formed, destroyed, and 
unchanged (see figure 3). 
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Fig. 5. The original data are prepared for the mining task: the Boolean transformation 
of the data can be done by not taking into account the atom mapping, i.e. one single 
line in the Boolean table, or by taking into account the atom mapping, i.e. two lines 
in the table. 



For example, as shown in figure 5 (in association with the reaction introduced 
in figure 3), the block correspondence is taken into account implicitly in the first 
point of view, and explicitly in the second 3 . These two points of view on the 
analysis of the content of reaction provide two kinds of Boolean tables. The 

3 From a synthesis point of view, the first mode is suitable for studying the chemos- 
electivity of functionality interchanges, and the second mode is more suitable for 
comparing the relative reactivities of the studied functions. 
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rows correspond to the entries related to a single reaction, one row for the global 
(or implicit) point of view, and two or more for the specific (or explicit) point 
of view. The columns correspond to the three families of functions, destroyed, 
formed and unchanged (the same functions are repeated three times, one time 
per type of columns). Two remarks can be done: firstly, both correspondence 
have been used during the experiment, and, secondly, in both cases, spatial 
information on the graph structure of the molecules is lost. 

3.3 The Search for Itemsets 

and the Extraction of Association Rules 

The Close and the Pascal algorithms have been applied to Boolean tables (built 
as indicated just above) for generating first itemsets, i.e. sets of functions (with 
an associated support), and then association rules. The study of the extracted 
frequent itemsets may be done with different points of view. Firstly, studying 
frequent itemsets of length 2 or 3 enables the analyst to determine basic rela- 
tions between functions. For example searching for a formed functions F f (_ f for 
formed) deriving from a broken function F d (_ d for destroyed) leads to the study of 
the itemsets F d n F f , where the symbol n stands for the conjunction of functions. 
In some cases, a reaction may depend on functions present in both reactants and 
products that remain unchanged (_ u for unchanged) during the reaction appli- 
cation, leading to the study of frequent itemsets such as Ff n F u n F d . This kind 
of itemsets can be searched for extracting a “protection function” supposed to 
be stable under given experimental conditions. 

The extraction of association rules gives a complementary perspective on 
the knowledge extraction process. For example, searching for the more frequent 
ways to form a function F f from a function F d leads to the study of rules such 
as F f — > F d : indeed, this rule has to be read in a retrosyntlretic way, i.e. if the 
function F f is formed then this means that the function F d is destroyed. Again, 
this rule can be generalized in the following way: determining how a function 
F f is formed from two destroyed functions F di and F d2 , knowing say that the 
function F dl is actually destroyed, leads to the study of the association rules 
such as Ff n F dl — > F d2 . It must be noticed that for the sake of simplicity, the 
examples have been kept formal here. Concrete examples can be found either in 
[3] or in [4]. 

As usual, the number of itemsets and of association rules to be considered 
depends on: 

— the way of considering the block correspondence, either implicit (one entry 
per reaction) or explicit (two or more entries per reaction) , 

— the minimal value of the support of the itemsets to be considered, 

— the confidence level chosen for considering and interpreting the association 
rules. 

The results obtained by the application of the data mining algorithms are 
discussed in the two next sections, firstly from a chemical point of view, and 
then from a knowledge discovery point of view. 
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4 Chemical Interpretation 

of the Knowledge Extraction Results 

A whole set of results of the application of the data mining process on the 
ORGSYN-2000 and jsm-2002 databases is given in [4]. These results show that 
both reaction databases share many common points though they differ in terms 
of size and data coverage, i.e. among 500 functions included in the 7 if hierar- 
chy, only 170 are retrieved from ORGSYN-2000 while 300 functions are retrieved 
from JSM-2002. The same five functions are ranked at the first places in both 
databases with the highest occurrence frequency. However, some significant dif- 
ferences can be observed: a given function may be much more frequent in the 
ORGSYN-2000 database than in JSM-2002 database, and reciprocally. These dif- 
ferences can be roughly explained by different data selection criteria and editor 
motivations for both databases. 

A qualitative and statistical study of the results has shown the following 
behaviors. Some functions have a high stability, i.e. they mostly remain un- 
changed, and, in the contrary, some others functions are very reactive, i.e. they 
are mostly destroyed. All the reactive functions are more present in reactants 
than in products, and some functions are more often formed. Some functions, 
that are among the most widely used functions in organic synthesis, are more of- 
ten present and destroyed in reactants, e.g. alcohol and carboxylic acid. For 
example, among the standard reactions involving functions, it is well-known -for 
chemists- that the ester function derives from a combination of two functions, 
one of them being mostly an alcohol. The search for a second function relies on 
the study of rules such as ester f n alcohol d — > F d . The main functions that 
are retrieved are anhydride, carboxylic acid, ester, and acyl chloride. If 
the chemist is interested in the unchanged functions, then the analysis of the 

rule ester f FI alcohol d n anhydride d ► F u gives functions such as acetal, 

phenyl, alkene, and carboxylic acid. 

These first results provide a good overview on the function stability and 
reactivity. They also give partial answers to the questions that have been posed 
in section 2.1. However, some questions remain open, such as the classification of 
reactions with respect to a given point of view, e.g. reactivity, stereochemistry,. . . 

Working on functionality interchanges within organic synthesis is a complex 
problem, and the data mining experiment presented in this paper raises the 
question of the selection of the reaction databases. The choice of the ORGSYN- 
2000 and the jsm-2002 databases has been guided by their coverage relevance. 
The ORGSYN-2000 database provides an electronic version of the entire series of 
Organic Syntheses, and offers an access to new general synthesis methods. The 
principle followed by the editors of Organic Syntheses is particularly interesting 
since each synthesis method has been checked by experts in laboratories. This 
practice confers to the data a high confidence value, because they have been 
verified in compound preparations. On the other hand, the JSM-2002 database 
is a document-based organic reaction database presenting a high coverage in 
organic synthesis (from 1975). To be selected and recorded, a reaction must be 
novel or have a particular advantage over an existing one. In addition the reac- 
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tion must have a clear experimental method, and must be repeatable. For these 
reasons, the ORGSYN-2000 and the jsm-2002 databases appear to be suitable 
for a global study of chemical functionality. Both databases contain fine-grained 
selected data, and thus the information retrieval process is necessarily focused. 
The ORGSYN-2000 and the jsm-2002 databases have proven to be useful sources 
for exploring organic synthesis knowledge rather than for providing exhaustive 
information about particular reactions. 

5 Discussion: Frequent Itemsets, 

Rules and Chemical Reactions 

First of all, it can be mentioned that only a few research works hold on the appli- 
cation of data mining methods on reaction databases (see for example [8, 6, 12, 
13]). Moreover, these studies have different objectives, and are mainly concerned 
with molecular graph manipulation rather than reaction database mining. An- 
other study on the lattice-based classification of dynamic knowledge units has 
been a valuable source of inspiration for the present work [10], leading to the 
division of functions in three categories, formed, destroyed, and unchanged. The 
work in [10] is more focused on formal concept analysis and lattice construction 
rather than on data mining concerns. A number of topics can be discussed here 
regarding the experiment presented in this paper: 

— The abstraction of reactions within blocks and the separation in three kinds 
of blocks, namely formed, destroyed, and unchanged blocks. Indeed, this is 
one of the most original idea in that research work, that is responsible of the 
good results that have been obtained. This idea of the separation into three 
families may be reused in other contexts involving dynamic data. However, 
the transformation into a Boolean table has led to a loss of information, e.g. 
the connection information on reactions and blocks. This loss of information 
on the connection of the entries introduces a bias in the data mining process, 
that is quite difficult to take into account. 

— Frequent items or association rules are generic elements that can be used 
either to index (and thus organize) reactions or to retrieve reactions. Termed 
in another way, this means that frequent itemsets or extracted association 
rules may be in certain cases considered as a kind of meta-data giving meta- 
information on the bases that are under study. For example, questions that 
the chemist wants to be answered are the following: if A — > B is true, and 
B — > C is also true, then it can be deduced that A — > C is also true, meaning 
that we can have access to three reactions if needed (or the access to two 
reactions allows the access to a third inferred reaction). 

— Knowledge is used at every step of the knowledge extraction process, e.g. 
the coupling of the knowledge extraction process with the RESYN-ASSISTANT 
system, and domain ontologies such as the function ontologies, the role of the 
analyst,. . . Indeed, and this is one of the major lesson of this experiment: the 
knowledge discovery process in a specific domain such as organic synthesis 
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has to be knowledge-intensive, and has to be guided by domain knowledge, 
and an analyst as well, for obtaining substantial results. 

— The role of the analyst includes fixing the thresholds, and interpreting of the 
results. The thresholds must be chosen in function of the objectives of the 
analyst, and in function of the content of the databases. A threshold of 1% 
for an item support means that for a thousand of reactions, ten reactions 
may form a family: this is not a bad hypothesis. Moreover, if ten thousand 
reactions are considered, then 1% means that a hundred reactions may form 
a family, and in this case, this is a very realistic hypothesis. This shows that 
the thresholds are linked in a very close way to the knowledge of the domain. 
Here, we see again the influence of the domain: the value of a threshold here 
is very different from the values that can be used for a threshold in marketing 
analysis. 

Another remark may be done on what could be called “exceptions”, i.e. a 
reaction that appear only once in a database; this means that there is no 
other reaction of the same kind in the database, or, that the item associated 
with this reaction is a unique one. The notion of exception has no substantial 
meaning here: one unique reaction in one database may be found under 
several examples in another database. A unique exemplar is rather a matter 
of point of view taken by the editors of the considered database. 

— Other research directions have to investigated, namely sequential patterns 
[1, 17], or working with closed itemsets and icebergs [18]. Regarding closed 
itemsets and icebergs, it must be noticed that closed itemsets are the longer 
itemsets, and those that potentially bring the most of information for the 
current mining problem. Thus, it could be interesting to consider only these 
closed itemsets, and to work more in the spirit of formal concept analysis, 
where a concept lattice based on closed sets of properties- is built [11]. 
Moreover, the use of data mining methods such as frequent itemsets search 
or association rule extraction has proven to be useful, and has provided 
encouraging results. It could be interesting to test other (symbolic) data 
mining methods, e.g. OLAP technology, relational mining, cluster analysis, 
or Bayesian network classification, knowing that numerical methods such as 
hidden Markov models or neural networks are not really adapted to the kind 
of data that are considered in our experiment. 

6 Conclusion 

In this paper, we have presented an experiment on knowledge discovery in chem- 
ical reaction databases. Two databases have been deeply studied and mined, 
namely the ORGSYN-2000 and the JSM-2002 databases, using frequent levelwise 
itemset search and association rule extraction. The main topic of interest in the 
reactions is related with functionality interchanges. Thus, the reactions in the 
databases have been abstracted in terms of three kinds of building blocks for 
molecules involved in the reactions, namely formed, destroyed and unchanged 
blocks. This categorization of blocks has been the basis for building the Boolean 
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tables on which data mining algorithms such as Close and Pascal have been 
applied. From a chemical point of view, the results are very encouraging, and 
provide a set of meta-data for organizing and retrieving chemical reaction ac- 
cording to given synthesis objectives. From a knowledge discovery point of view, 
a number of questions can be discussed, such as the value of the thresholds 
(usually lower than in marketing analysis), on the processing of the data, and 
on the interpretation of the results. Moreover, two major elements have to be 
pointed out, and can be reused in other contexts : the categorization of dy- 
namic data such as reactions into three families, here, formed, destroyed and 
unchanged functions, and the use of knowledge at every stage of the knowledge 
discovery process. Indeed, in a domain such as organic synthesis, the knowledge 
discovery process has to be fully guided by domain knowledge, and the analyst, 
an expert of the domain, as well. There are a number of research perspectives 
following the present work, including the adaptation of sequential pattern algo- 
rithms to chemical reactions, taking actually into account the structures of the 
molecules involved in reactions, and working in the spirit of concept analysis for 
lattice-based classification of the data. 
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Abstract. The more sophisticated fuzzy clustering algorithms, like the 
Gustafson-Kessel algorithm [11] and the fuzzy maximum likelihood es- 
timation (FMLE) algorithm [10] offer the possibility of inducing clusters 
of ellipsoidal shape and different sizes. The same holds for the EM al- 
gorithm for a mixture of Gaussians. However, these additional degrees 
of freedom often reduce the robustness of the algorithm, thus sometimes 
rendering their application problematic. In this paper we suggest shape 
and size regularization methods that handle this problem effectively. 



1 Introduction 

Prototype-based clustering methods, like fuzzy clustering [1,2,12], expectation 
maximization (EM) [6] of a mixture of Gaussians [9], or learning vector quan- 
tization [15,16], often employ a distance function to measure the similarity of 
two data points. If this distance function is the Euclidean distance , all clusters 
are (hyper-)splrerical. However, more sophisticated approaches rely on a cluster- 
specific Mahalanobis distance , making it possible to find clusters of (hyper-) 
ellipsoidal shape. In addition, they relax the restriction (as it is present, e.g., in 
the fuzzy c-means algorithm) that all clusters have the same size [13]. Unfortu- 
nately, these additional degrees of freedom often reduce the robustness of the 
clustering algorithm, thus sometimes rendering their application problematic. 

In this paper we consider how shape and size parameters of a cluster can 
be regularized, that is, modified in such a way that extreme cases are ruled out 
and/or a bias against extreme cases is introduced, which effectively improves ro- 
bustness. The basic idea of shape regularization is the same as that of Tikhonov 
regularization for linear optimization problems [18, 8], while size and weight reg- 
ularization is based on a bias towards equality as it is well-known from Laplace 
correction or Bayesian approaches to the estimation of probabilities. 

This paper is organized as follows: in Sections 2 and 3 we briefly review some 
basics of mixture models and the expectation maximization algorithm as well as 
fuzzy clustering. In Section 4 we discuss our shape, size, and weight regularization 
schemes. In Section 5 we present experimental results on well-known data sets 
and finally, in Section 6, we draw conclusions from our discussion. 



J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202, pp. 52-62, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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2 Mixture Models and the EM Algorithm 

In a mixture model [9] it is assumed that a given data set X = {xj \ j = 1, . . . , n} 
has been drawn from a population of c clusters. Each cluster is characterized 
by a probability distribution, specified as a prior probability and a conditional 
probability density function (cpdf). The data generation process may then be 
imagined as follows: first a cluster i, i € {l,...,c}, is chosen for a datum, 
indicating the cpdf to be used, and then the datum is sampled from this cpdf. 
Consequently the probability of a data point x can be computed as 

C 

Px(x; 0) = ^pcii; Oi) • fx\c(x\i; Oi), 

i- 1 

where C is a random variable describing the cluster i chosen in the first step, 
X is a random vector describing the attribute values of the data point, and 
0 = {0i , . . . , 0 C } with each <9,, containing the parameters for one cluster (that 
is, its prior probability Oi = pc{i', <9*) and the parameters of the cpdf). 

Assuming that the data points are drawn independently from the same dis- 
tribution (i.e., that the probability distributions of their underlying random vec- 
tors Xj are identical) , we can compute the probability of a data set X as 

n c 

P{X-0) = • fXjICjiXjfcOi), 

3=1 *= 1 



Note, however, that we do not know which value the random variable Cj , which 
indicates the cluster, has for each example case Xj. Fortunately, though, given 
the data point, we can compute the posterior probability that a data point x 
has been sampled from the cpdf of the i-tli cluster using Bayes’ rule as 



Pc\x{i\x;0) 



Pc{i\ Oi) ■ fx\c{x\i\0j) 
fx(x;0) 



pc(i;Oi) ■ fx\c(x\i;0i) 
ELi Pc( k '> ®k) ■ fx\c(x\k\ 0 k ) ' 



This posterior probability may be used to complete the data set w.r.t. the cluster, 
namely by splitting each datum Xj into c data points, one for each cluster, which 
are weighted with the posterior probability PcAx* (' i\ x j \ <9)- This idea is used in 
the well known expectation maximization (EM) algorithm [6] , which consists in 
alternately computing these posterior probabilities and estimating the cluster 
parameters from the completed data set by maximum likelihood estimation. 

For clustering numeric data it is usually assumed that the cpdf of each cluster 
is an m-variate normal distribution (Gaussian mixture [9,3]), i.e. 





M) T £-V- 




wlrere fii is the mean vector and S, the covariance matrix of the normal distri- 
bution, * = 1, . . . , c, and to is the number of dimensions of the data space. 
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In this case the maximum likelihood estimation formulae are 



1 . , 

-' 52 pc\x i (i\x j -,e) 

3 = 1 



and 



Ej=i Pc\Xj ' x j 

'E]=iPc\x J (i\x j ;0) 



for the prior probability 9i and the mean vector fit and 

x . _ E "=1 PciXj ( i\xj ; O) ■ (Xj - m)(xj - Ac) t 
'Ej=iPc\x J 

for the covariance matrix X, of the i-tlr cluster, i = 1, . . . , c. 



3 Fuzzy Clustering 

While most classical clustering algorithms assign each datum to exactly one 
cluster, thus forming a crisp partition of the given data, fuzzy clustering allows 
for degrees of membership, to which a datum belongs to different cluster [1, 2, 12]. 
Most fuzzy clustering algorithms are objective function based: they determine 
an optimal (fuzzy) partition of a given data set X = {xj \ j = 1 , ,n} into 
c clusters by minimizing an objective function 

c n 

J(X,U,C) = £5>-4 

i-1 j= 1 

subject to the constraints 

n 

^ Uij > 0, for all i £ {1, . . . , c}, 
i= i 

C 

F, = 1, for all j G {1, . . . , n}, 

i= 1 

where Ujj £ [0, 1] is the membership degree of datum Xj to cluster i and dij is the 
distance between datum Xj and cluster i. The cxn matrix U = (uy) is called 
the fuzzy partition matrix and C describes the set of clusters by stating location 
parameters (i.e. the cluster center) and maybe size and shape parameters for each 
cluster. The parameter w, w > 1, is called the fuzzifier or weighting exponent. 
It determines the “fuzziness” of the classification: with higher values for w the 
boundaries between the clusters become softer, with lower values they get harder. 
Usually w = 2 is chosen. Hard clustering results in the limit for w — > 1. However, 
a hard assignment may also be determined from a fuzzy result by assigning each 
data point to the cluster to which it has the highest degree of membership. 

Constraint (1) guarantees that no cluster is empty and constraint (2) ensures 
that each datum has the same total influence by requiring that the member- 
ship degrees of a datum must add up to 1. Because of the second constraint 



and 



(1) 



(2) 
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this approach is usually called probabilistic fuzzy clustering , because with it the 
membership degrees for a datum formally resemble the probabilities of its being 
a member of the corresponding clusters. The partitioning property of a proba- 
bilistic clustering algorithm, which “distributes” the weight of a datum to the 
different clusters, is due to this constraint. 

Unfortunately, the objective function J cannot be minimized directly. There- 
fore an iterative algorithm is used, which alternately optimizes the membership 
degrees and the cluster parameters [1,2,12], That is, first the membership de- 
grees are optimized for fixed cluster parameters, then the cluster parameters are 
optimized for fixed membership degrees. The main advantage of this scheme is 
that in each of the two steps the optimum can be computed directly. By iterating 
the two steps the joint optimum is approached (although, of course, it cannot 
be guaranteed that the global optimum will be reached - the algorithm may get 
stuck in a local minimum of the objective function J). 

The update formulae are derived by simply setting the derivative of the 
objective function J w.r.t. the parameters to optimize equal to zero (necessary 
condition for a minimum) . Independent of the chosen distance measure we thus 
obtain the following update formula for the membership degrees [12]: 

_ 2 

Uij = V (3) 

i ^kj 

that is, the membership degrees represent the relative inverse squared distances 
of a data point to the different cluster centers, which is a very intuitive result. 

The update formulae for the cluster parameters, however, depend on what 
parameters are used to describe a cluster (location, shape, size) and on the 
chosen distance measure. Therefore a general update formula cannot be given. 
Here we briefly review the three most common cases: The best-known fuzzy 
clustering algorithm is the fuzzy c-means algorithm, which is a straightforward 
generalization of the classical crisp c-means algorithm. It uses only cluster centers 
for the cluster prototypes and relies on the Euclidean distance, i.e., 

d ij — ( x j ~ Mi) ( x j — Mi)> 



where Hi is the center of the i-tli cluster. Consequently it is restricted to finding 
spherical clusters of equal size. The resulting update rule is 



Mi — 




'U'ij ’Kj 
1 u ij 



(4) 



that is, the new cluster center is the weighted mean of the data points assigned 
to it, which is again a very intuitive result. 

The Gustafson-Kessel algorithm [11] uses the Mahalanobis distance, i.e., 



dij — ( x j Mi) i x j Mi) > 



where /x; is the cluster center and X, is a cluster-specific covariance matrix with 
determinant 1 that describes the shape of the cluster, thus allowing for ellipsoidal 
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clusters of equal size. This distance function leads to same update rule (4) for 
the clusters centers. The covariance matrices are updated according to 



where 



Xq=i u ij( x j M i)( x j Mi) 1 



and m is the number of dimensions of the data space. X* is called the fuzzy 
covariance matrix, which is simply normalized to determinant 1 to meet the 
abovementioned constraint. Compared to standard statistical estimation proce- 
dures, this is also a very intuitive result. It should be noted that the restriction 
to cluster of equal size may be relaxed by simply allowing general covariance 
matrices. However, depending on the characteristics of the data, this additional 
degree of freedom can deteriorate the robustness of the algorithm. 

Finally, the fuzzy maximum likelihood estimation (FMLE) algorithm [10] is 
based on the assumption that the data was sampled from a mixture of c mul- 
tivariate normal distributions as in the statistical approach of mixture models 
(cf. Section 2). It uses a (squared) distance that is inversely proportional to the 
probability that a datum was generated by the normal distribution associated 
with a cluster and also incorporates the prior probability of the cluster, i.e., 




where 9i is the prior probability of the cluster, fii is the cluster center, X, 
a cluster-specific covariance matrix, which in this case is not required to be 
normalized to determinant 1, and m the number of dimensions of the data space 
(cf. Section 2). For the FMLE algorithm the update rules are not derived from 
the objective function due to technical obstacles, but by comparing it to the 
expectation maximization (EM) algorithm for a mixture of normal distributions 
(cf. Section 2), which, by analogy, leads to the same update rules for the cluster 
center and the cluster-specific covariance matrix as for the Gustafson-Kessel 
algorithm [12], that is, equations (4) and (5). The prior probability 9i is, in 
direct analogy to statistical estimation (cf. Section 2), computed as 



= (6) 
i = i 

Note that the difference to the expectation maximization algorithm consists 
in the different ways in which the membership degrees (equation (3)) and the 
posterior probabilities in the EM algorithm are computed. 

Since the high number of free parameters of the FMLE algorithm renders it 
unstable on certain data sets, it is usually recommended [12] to initialize it with 
a few steps of the very robust fuzzy c-means algorithm. The same holds, though 
to a somewhat lesser degree, for the Gustafson-Kessel algorithm. 

It is worth noting that of both the Gustafson-Kessel as well as the FMLE 
algorithm there exist so-called axes-parallel version, which restrict the covariance 
matrices X, to diagonal matrices and thus allow only axes-parallel ellipsoids [14]. 
These variants have certain advantages w.r.t. robustness and execution time. 
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4 Regularization 

Regularization, as we use this term in this paper, means to modify the parameters 
of a cluster in such a way that certain conditions are satisfied or at least that 
a tendency (of varying strength, as specified by a user) towards satisfying these 
conditions is introduced. In particular we consider regularizing the (ellipsoidal) 
shape, the (relative) size, and the (relative) weight of a cluster. 



4.1 Shape Regularization 

The shape of a cluster is represented by its covariance matrix Sj. Intuitively, E, 
describes a general (hyper-)ellipsoidal shape, which can be obtained, for example, 
by computing the Clrolesky decomposition or the eigenvalue decomposition of E* 
and mapping the unit (lryper-)sphere with it. 

Shape regularization means to modify the covariance matrix, so that a certain 
relation of the lengths of the major axes of the represented (hyper-)ellipsoid is 
obtained or that at least a tendency towards this relation is introduced. Since 
the lengths of the major axes are the roots of the eigenvalues of the covariance 
matrix, regularizing it means shifting the eigenvalues of Ej. Note that such a 
shift leaves the eigenvectors unchanged, i.e. , the orientation of the represented 
(hyper-)ellipsoid is preserved. Note also that such a shift of the eigenvalues is 
the basis of the Tikhonov regularization of linear optimization problems [18,8], 
which inspired our approach. We suggest two methods: 

Method 1: The covariance matrices E;, i = 1, . . . , c, are adapted according to 
y,(adap) ^2 Sj h 1 ^2 T u) h 1 

* _<7i ' VisT + Mi _<7i 'vfE J + a 2 /i 2 l|’ 

where m is the dimension of the data space, 1 is a unit matrix, of = ^/fsTf 
is the equivalent isotropic variance (equivalent in the sense that it leads to the 
same (lryper-)volume, i.e., |S;| = |ofl|), S* = <t“ 2 S, is the covariance matrix 
scaled to determinant 1, and h is the regularization parameter. 

This regularization shifts up all eigenvalues by the value of of h 2 and then 
renormalizes the resulting matrix so that the determinant of the old covariance 
matrix is preserved (i.e., the (hyper-) volume is kept constant). This regulariza- 
tion tends to equalize the lengths of the major axes of the represented (hyper-) 
ellipsoid and thus introduces a tendency towards (hyper-) spherical clusters. This 
tendency is the stronger, the greater the value of h. In the limit, for h — > oo, the 
clusters are forced to be exactly spherical; for h = 0 the shape is left unchanged. 

Method 2: The above method always changes the length ratios of the major 
axes and thus introduces a general tendency towards (lryper-)spherical clusters. 
In this method, however, a limit r, r > 1, for the length ratio of the longest to 
the shortest major axis of the (hyper-)ellipsoid is used and only if this limit is 
exceeded, the eigenvalues are shifted in such a way that the limit is satisfied. 
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Formally: let A k, k = 1 , . . . m, be the eigenvalues of the matrix S, . Set 



h 2 



0, 

max \ k - r 2 min J^A k 



if 



max^^fc < 
min \ k 



otherwise, 



and execute Method 1 with this value of h 2 . 



4.2 Size Regularization 

The size of a cluster can be described in different ways, for example, by the 
determinant of its covariance matrix E*, which is a measure of the clusters 
squared (lryper-)volume, an equivalent isotropic variance a 2 or an equivalent 
isotropic radius (standard deviation) a, : (equivalent in the sense that they lead 
to the same (hyper-)volume, see above). The latter two measures are defined as 

a 2 = and a l = ^ 2 = 2 ^ 

and thus the (lryper-)volume may also be written as <x™ = -^/jsTf. 

Size regularization means to ensure a certain relation between the cluster 
sizes or at least to introduce a tendency into this direction. We suggest three 
different versions of size regularization, in each of which the measure that is 
used to describe the cluster size is specified by an exponent a of the equivalent 
isotropic radius cq, with the special cases: 

a = 1 : equivalent isotropic radius, 
a = 2 : equivalent isotropic variance, 
a = m : (lryper-)volume. 

Method 1: The equivalent isotropic radii <%, are adapted according to 



(adap) 



Efc=l a k 

ELiK + fe) 



• « + b)= { 



EL i L 
cb + ELi c 



W + b)- 



That is, each cluster size is increased by the value of the regularization parame- 
ter b and then the sizes are renormalized so that the sum of the cluster sizes is 
preserved. However, the parameter s may be used to scale the sum of the sizes up 
or down (by default s = 1). For b — » oo the cluster sizes are equalized completely, 
for 6 = 0 only the parameter s has an effect. This method is inspired by Laplace 
correction or Bayesian estimation with an uniformative prior (see below). 

Method 2: This method does not renormalize the sizes, so that the size sum 
increases by cb. However, this missing renormalization may be mitigated to some 
degree by specifying a value of the scaling parameter s that is smaller than 1. 
The equivalent isotropic radii a, are adapted according to 



(adap) 
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Method 3: The above methods always change the relation of the cluster sizes 
and thus introduce a general tendency towards clusters of equal size. In this 
method, however, a limit r, r > 1, for the size ratio of the largest to the smallest 
cluster is used and only if this limit is exceeded, the sizes are changed in such a 
way that the limit is satisfied. To achieve this, b is set according to 



b = 



0, 

max £ =1 <t£ — r min 
r — 1 



if ma x% =1 gg 

min k=i a k 



< r, 



otherwise, 



and then Method 1 is executed with this value of b. 



4.3 Weight Regularization 



A cluster weight 9i only appears in the mixture model approach and the FMLE 
algorithm, where it describes the prior probability of a cluster. For the cluster 
weight we may use basically the same regularization methods as for the cluster 
size, with the exception of the scaling parameter s, since the 9i are probabilities, 
i.e., we must ensure y^‘L 1 Oj = 1. Therefore we have: 

Method 1: The cluster weights 9i 



^(adap) 



Efc=l(^fc + b) 



■ (Pi + b) 



Ea-= i Ok 

cb + Efc=l 



•( 9i + b ), 



where b is the regularization parameter. Note that this method is equivalent to a 
Laplace corrected estimation of the prior probabilities or a Bayesian estimation 
with an uninformative (uniform) prior. 

Method 2: The value of the regularization parameter b is computed as 

... max l =1 9 k 

if 1 < r, 

mm %=\9 k 

otherwise, 

with a user-specified maximum weight ratio r, r > 1, and then Method 1 is 
executed with this value of b. 



1 °' 

| max — r min £ =1 0fc 

l i — 1 



5 Experiments 

We implemented our regularization methods as part of an expectation maximiza- 
tion and fuzzy clustering program written by the first author of this paper and 
applied it to several different data sets from the UCI machine learning repository 
[4]. In all data sets each dimension was normalized to mean value 0 and stan- 
dard deviation 1 in order to avoid any distortions that may result from different 
scaling of the coordinate axes. 
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Fig. 1 . Result of Gustafson-Kessel algorithm on the iris data with fixed cluster size 
without (left) and with shape regularization (right, method 2 with r = 4). Both images 
show the petal length (horizontal) and the petal width (vertical). Clustering was done 
on all four attributes (sepal length and sepal width in addition to the above). 



As one illustrative example, we present here the result of clustering the iris 
data (excluding, of course, the class attribute) with the Gustafson-Kessel algo- 
rithm using three cluster of fixed size (measured as the isotropic radius) of 0.4 
(since all dimensions are normalized to mean 0 and standard deviation 1, 0.4 is a 
good size of a cluster if three clusters are to be found). The result without shape 
regularization is shown in Figure 1 on the left. Due to the few data points located 
in a thin diagonal cloud on the right border on the figure, the middle cluster is 
drawn into a very long ellipsoid. Although this shape minimizes the objective 
function, it may not be a desirable result, because the cluster structure is not 
compact enough. Using shape regularization method 2 with r = 4 the cluster 
structure shown on the right in Figure 1 is obtained. In this result the clusters 
are more compact and resemble the class structure of the data set. 

As another example let us consider the result of clustering the wine data with 
the fuzzy maximum likelihood estimation (FMLE) algorithm using three clusters 
of variable size. We used attributes 7, 10, and 13, which are the most informative 
w.r.t. the class assignments. One result without size regularization is shown in 
Figure 2 on the left. However, the algorithm is much too unstable to present 
a unique result. Often enough clustering fails completely, because one cluster 
collapses to a single data point - an effect that is due to the steepness of the 
Gaussian probability density function. This situation is considerably improved 
with size regularization, a result of which (which sometimes, with a fortunate 
initialization, can also be achieved without) is shown on the right in Figure 2. 
It was obtained with method 3 with r = 2. Although the result is still not 
unique and sometimes clusters still focus on very few data points, the algorithm 
is considerably more stable and reasonable results are obtained much more of- 
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Fig. 2. Result of fuzzy maximum likelihood estimation (FMLE) algorithm on the 
wine data with fixed cluster weight without (left) and with size regularization (right, 
method 3 with r = 2). Both images show attribute 7 (horizontal) and attribute 10 
(vertical). Clustering was done on attributes 7, 10, and 13. 



ten than without regularization. Hence we can conclude that size regularization 
considerably improves the robustness of the algorithm. 

6 Conclusions 

In this paper we suggested shape and size regularization methods for clustering 
algorithms that use a cluster-specific Mahalanobis distance to describe the shape 
and the size of a cluster. The basic idea is to introduce a tendency towards equal 
length of the major axes of the represented (hyper-)ellipsoid and towards equal 
cluster sizes. As the experiments show, these methods improve the robustness of 
the more sophisticated fuzzy clustering algorithms, which without them suffer 
from instabilities even on fairly simple data sets. Regularized clustering can 
even be used without an initialization by the fuzzy c-means algorithm. It should 
be noted that with a time-dependent shape regularization parameter one may 
obtain a soft transition from the fuzzy c-means algorithm (spherical clusters) to 
the Gustafson-Kessel algorithm (general ellipsoidal clusters). 

Software. A free implementation of the described methods as command line 
programs for expectation maximization and fuzzy clustering can be found at 

http : //fuzzy . cs . uni-magdeburg . de/~borgelt/ software . html#cluster 
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Abstract. Three methods for combining multiple clustering systems are 
presented and evaluated, focusing on the problem of finding the corre- 
spondence between clusters of different systems. In this work, the clusters 
of individual systems are represented in a common space and their cor- 
respondence estimated by either “clustering clusters” or with Singular 
Value Decomposition. The approaches are evaluated for the task of topic 
discovery on three major corpora and eight different clustering algo- 
rithms and it is shown experimentally that combination schemes almost 
always offer gains compared to single systems, but gains from using a 
combination scheme depend on the underlying clustering systems. 



1 Introduction 

Clustering has an important role in a number of diverse fields, such as genomics 
[1], lexical semantics [2], information retrieval [3] and automatic speech recogni- 
tion [4], to name a few. A number of different clustering approaches have been 
suggested [5] such as agglomerative clustering, mixture densities and graph par- 
titioning. Most clustering methods focus on individual criteria or models and 
do not address issues of combining multiple different systems. The problem of 
combining multiple clustering systems is analogous to the classifier combination 
problem, that has received increased attention over the last years [6]. Unlike 
the classifier combination problem, though, the correspondence between clusters 
of different systems is unknown. For example, consider two clustering systems 
applied to nine data points and clustered in three groups. System A’s output is 
oa = [1, 1, 2, 3, 2, 2, 1, 3, 3] and system B’s output is Ob = [2, 2, 3, 1, 1, 3, 2, 1, 1], 
where the *-th element of o is the group to which data point i is assigned. Al- 
though the two systems appear to be making different decisions, they are in fact 
very similar. Cluster 1 of system A and cluster 2 of system B are identical, and 
cluster 2 of system A and cluster 3 of system B agree 2 out of 3 times, as cluster 
3 of system A and cluster 1 of system B. If the correspondence problem is solved 
then a number of system combination schemes can be applied. 

Finding the optimum correspondence requires a criterion and a method for 
optimization. The criterion used here is maximum agreement, i.e. find the cor- 
respondence where clusters of different systems make the maximum number of 
the same decisions. Second, we must optimize the selected criterion. Even if we 
assume a 0 or 1 correspondence between clusters with only two systems of M 



J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202, pp. 63-74, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




64 



Constantinos Boulis and Mari Ostendorf 



topics each, a brute-force approach would require the evaluation of Ml possible 
solutions. In this work, three novel methods are presented for determining the 
correspondence of clusters and combining them. Two of the three methods are 
formulated and solved with linear optimization and the third uses singular value 
decomposition . 

Another contribution of this work is the empirical result that the combination 
schemes are not independent of the underlying clustering systems. Most of the 
past work has focused on combining systems generated from a single clustering 
algorithm (using resampling or different initial conditions), usually /c-means. In 
this work, we experimentally show that the relative gains of applying a combi- 
nation scheme are not the same across eight different clustering algorithms. For 
example, although the mixture of multinomials was one of the worse perform- 
ing clustering algorithms, it is shown that when different runs were combined 
it achieved the best performance of all eight clustering algorithms in two out of 
three corpora. The results suggest that an algorithm should not be evaluated 
solely on the basis of its individual performance, but also on the combination of 
multiple runs. 

2 Related Work 

Combining multiple clustering systems has recently attracted the interest of 
several researchers in the machine learning community. In [7], three different 
approaches for combining clusters based on graph-partitioning are proposed and 
evaluated. The first approach avoids the correspondence problem by defining a 
pairwise similarity matrix between data points. Each system is represented by a 
D x D matrix ( D is the total number of observations) where the (i, j) position is 
either 1 if observations i and j belong to the same cluster and 0 otherwise. The 
average of all matrices is used as the input to a final similarity-based clustering 
algorithm. The core of this idea also appears in [8-12]. A disadvantage of this 
approach is that it has quadratic memory and computational requirements. Even 
by exploiting the fact that each of the D x D matrices is symmetric and sparse, 
this approach is impractical for high D. 

The second approach taken in [7], is that of a lrypergraplr cutting problem. 
Each one of the clusters of each system is assumed to be a hyperedge in a 
lrypergraplr. The problem of finding consensus among systems is formulated as 
partitioning a lrypergraplr by cutting a minimum number of hyperedges. This ap- 
proach is linear with the number of data points, but requires fairly balanced data 
sets and all hyperedges having the same weight . A similar approach is presented 
in [13], where each data point is represented with a set of met a- features. Each 
meta-feature is the cluster membership for each system, and the data points are 
clustered using a mixture model. An advantage of [13] is that it can handle miss- 
ing meta-features, i.e. a system failing to cluster some data points. Algorithms 
of this type, avoid the cluster correspondence problem by clustering directly the 
data points. 

The third approach presented in [7], is to deal with the cluster correspon- 
dence problem directly. As stated in [7], the objective is to “cluster clusters”, 
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where each cluster of a system is a lryperedge and the objective is to combine 
similar lryperedges. The data points will be assigned to the combined lryper- 
edge they most strongly belong to. Clustering hyperedges is performed by using 
graph-partitioning algorithms. The same core idea can also be found in [10,14- 
16]. In [10], different clustering solutions are obtained by resampling and are 
aligned with the clusters estimated on all the data. In both [14, 15], the different 
clustering solutions are obtained by multiple runs of the fc-means algorithm with 
different initial conditions. An agglomerative pairwise cluster merging scheme is 
used, with a heuristic to determine the corresponding clusters. In [16], a two- 
stage clustering procedure is proposed. Resampling is used to obtain multiple 
solutions of fc-means. The output centroids from multiple runs are clustered with 
a new fc-means run. A disadvantage of [16] is that it requires access to the origi- 
nal features of the data points, while all other schemes do not. Our work falls in 
the third approach, i.e. attempts to first find a correspondence between clusters 
and then combine clusters without requiring the original observations. 



3 Finding Cluster Correspondence 

In this paper, three novel methods to address the cluster correspondence problem 
are presented. The first two cast the correspondence problem as an optimization 
problem, and the third method is based on singular value decomposition. 



3.1 Constrained and Unconstrained Search 



We want to find the assignment of clusters to entities (metaclusters) such that 
the overall agreement among clusters is maximized. Suppose R{ c ,s} is the D xl 
vector representation of cluster c of system s (with D being the total num- 
ber of documents). The fc-tlr element of R{ c , s } is picluster = c\observation = 
fc, system = s). The agreement between clusters {c, s} and {d, s'} is defined as: 

9{c,s},{c' ,s'} = R{c,s} ‘ R{c',s'} ( 1 ) 



In addition, suppose that = 1 if cluster c of system s is assigned to 

metacluster m and 0 otherwise, and is the “reward” of assigning cluster c 

of system s to metacluster m, defined as: 

^ 9{c,s},{c\s ■'} ,{c',s'} e I(m) <=> a{™ } s , } ^0 (2) 



r f m l 

r { c>5 } 



|/(m)| ^ 

1 v n {c', S '}e/(m) 



We seek to find the argument that maximizes: 



m S C B 



A* = arg max EEE A 



{m} {m} 

{c,s} r {c,s} 



m= 1 s=l c= 1 



(3) 



M 

subject to the constraints Ed:. 1 }" 1 . 

m= 1 



(4) 
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Optionally, we may want to add the following constraint: 



c. 

E \{™} 

A {c, S } 

c=l 



= 1, Vs, TO 



( 5 ) 



This is a linear optimization problem and efficient techniques exist for maximiz- 
ing the objective function. In our implementation, the GNU Linear Program- 
ming library was used 1 . The scheme that results from omitting the constraints 
of equation (5) is referred to as unconstrained , while including them results in 
the constrained combination scheme. The added constraints ensure that exactly 
one cluster from each system is assigned to each metacluster and are useful when 
C s = C Vs. The entire procedure is iterative, starting from an initial assignment 
of clusters to metaclusters and alternating between equations (2) and (3). 

The output of the clustering procedure is matrix F of size D x M, where 
each column is the centroid of each metacluster. The F m column is given by: 

F m = V RL .X (6) 



1 1 n {C,s}el(m) 



{c,s} 



This can be the final output or a clustering stage can be applied using the 
F matrix as the observation representations. Note that the assignments can be 
continuous numbers between 0 and 1 (soft decisions) and that the systems do not 
need to have the same number of clusters, nor do the final number of metaclusters 
need to be the same as the number of clusters. To simplify the experiments, here 
we have assumed that the number of clusters is known and equal to the number 
of topics, i.e. C s = M = #of topics Vs. The methodology presented here does 
not assume access to the original features and therefore it can be applied in cases 
irrespective of whether the original features were continuous or discrete. 

The optimization procedure is very similar to any partition-based clustering 
procedure trained with the Expectation-Maximization algorithm, like /e-means. 
In fact, this scheme is “clustering clusters”, i.e. expressing clusters in a com- 
mon vector space and grouping them into similar sets. Although the problem is 
formulated from the optimization perspective, any clustering methodology can 
be applied (statistical, graph-partitioning). However, there are two reasons that 
favor the optimization approach. First, it directly links the correspondence prob- 
lem to an objective function that can be maximized. Second, it allows us to easily 
integrate constraints during clustering such as equation (5). As it is shown in 
section 5, the constrained clustering scheme offers gains over the unconstrained 
case, when it is appropriate for the task. 



3.2 Singular Value Decomposition Combination 

The third combination approach we introduce is based on Singular Value De- 
composition (SVD). As before, we will assume that all systems have the same 
number of clusters for notational simplicity, though it is not required of the 

http: / / www.gnu.org/ software / glpk /glpk.html 
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algorithm. Just as before, we construct matrix R of size D x SC ( D is the 
number of observations, S is the number of systems, C the number of clusters), 
where each row contains the cluster posteriors of all systems for a given obser- 
vation. R can be approximated as R ~ U * S * A‘ where U is orthogonal and 
of size D x C, S is diagonal and of size C x C and A is orthogonal and of size 
(SC) x C . The final metaspace is R * A of size D x C. If we define p s (c\d) = 
p(cluster = c\observation = d, system. = s) t c=l...C,s = l...S,d=l...D 
and hc(l) = l — C[l/C\ (remainder of division), then the 4>d, c element of R* A 
is given by: 

S 

fidtC — 'y ' A g c (k),cPk (hc(9c(k))\d) (7) 

fc = 1 

where g c (-) is a function that aligns clusters of different systems and is esti- 
mated by SVD. In essence, SVD identifies the most correlated clusters, i.e. finds 
g c (') and combines them with linear interpolation. The A weights provide a soft 
alignment of clusters. After SVD, a final clustering is performed using the (j>d, c 
representation. 



4 Evaluating Clustering Systems 



There is no consensus in the literature on how to evaluate clustering decisions. In 
this work, we used two measures to evaluate the clustering output. The first is the 
classification accuracy of a one-to-one mapping between clusters and true classes. 
The problem of finding the optimum assignment of M clusters to M classes 
can be formulated and solved with linear programming. If r-ij is the “reward” 
of assigning cluster i to class j (which can be the number of observations they 
agree), Ajj=l if cluster i is assigned to class j and 0 otherwise are the parameters 
to estimate, then we seek to find: maxj iJ ]TA rijX-ij under the constraints 

A ij = 1 and Yj A i,j = 1. The constraints will ensure a one-to-one mapping. 

The second measure we used is the normalized mutual information (NMI) 
between clusters and classes, introduced in [7]. The measure does not assume a 
fixed cluster-to-class mapping but rather takes the average mutual information 
between every pair of cluster and class. It is given by: 



NMI = 




(8) 



where is the number of observations cluster i and class j agree, m is the 
number of observations assigned to cluster i, mj the number of observation of 
class j and D the total number of observations. It can be shown that 0 < NMI < 
1 with NMI = 1 corresponding to perfect classification accuracy. 



5 Experiments 

The multiple clustering system combination schemes that are introduced in this 
paper are general and can, in principle, be applied to any clustering problem. The 




68 



Constantinos Boulis and Mari Ostendorf 



task we have chosen to evaluate our metaclustering schemes is topic discovery, 
i.e. clustering documents according to their topic. Topic discovery is an especially 
hard clustering problem because of the high dimensionality of the data points 
and the redundancy of many features. To simplify our experiments, the number 
of topics is assumed to be known. This is an assumption that is not true in many 
practical cases, but standard techniques such as Bayesian Information Criterion 
[17] can be used to select the number of topics. It should be noted that the 
unconstrained and SVD combination schemes do not require the same number 
of clusters for all systems. On the other hand, the constrained clustering scheme 
was proposed based on this assumption. 

5.1 Corpora 

The techniques proposed in this work are applied on three main corpora with 
different characteristics. The first corpus is 20Newsgroups 2 , a collection of 18828 
postings into one of 20 categories (newsgroups). The second corpus is a subset 
of Reuters-21578 3 , consisting of 1000 documents equally distributed among 20 
topics. The third corpus is Switclrboard-I release 2.0 [18], a collection of 2263 
5-minute telephone conversations on 67 possible topics. Switclrboard-I and to a 
smaller extent 20Newsgroups, are characterized with a spontaneous, less struc- 
tured style. On the other hand, Reuters-21578 contains carefully prepared news 
stories for broadcasting. 20Newsgroups and the subset of Reuters are balanced, 
i.e. documents are equally divided by topics, but Switchboard-I is not. Also, the 
median length of a document varies significantly across corpora (155 words for 
20Newsgroups, 80 for the subset of Reuters-21578 and 1328 for Switchboard-I). 
Standard processing was applied in all corpora. Words in the default stoplist of 
CLUTO (total 427 words) are removed, the remaining stemmed and only tokens 
with T or more occurrences (T=5 for 20Newsgroups, T = 2 for Reuters-21578 and 
Switclrboard-1) are retained. These operations result in 26857 unique tokens and 
1.4M total tokens in 20Newsgroups, 4128 unique tokens and 50. 5K total tokens 
in Reuters, and 11550 unique and 0.4M total tokens in Switchboard. 

5.2 Clustering Algorithms 

A number of different clustering systems were used, including the mixture of 
multinomials (MixMulti) and the optimization-based clustering algorithms and 
criteria described in [19]. The MixMulti algorithm clusters documents by es- 
timating a mixture of multinomial distributions. The assumption is that each 
topic is characterized by a different multinomial distribution, i.e. different counts 
of each word given a topic. The probability of a document d is given by: p(d) oc 
p(w\c ) n ( w,d ) where M is the number of topics, Wd is the set 
of unique words that appear in document d, p(w\c) is the probability of word w 
given cluster c and n(w, d) is the count of word w in document d. The cluster 

2 http://www.ai. mit.edu/~jrennie/20Newsgroups/ 

3 http: / / www.daviddlewis.com/resources / testcollections / 
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Table 1 . Performance of different combination schemes on various clustering algo- 
rithms for 20Newsgroups. 





Single 


Best of 


SVD 


Constr. 


Unconstr. 


No 




Run 


100 runs 


Combin. 


Combin. 


Combin. 


Combin. 


II 

Accuracy 


.422 


.412 


.418 


.417 


.408 


.459 


NMI 


.486 


.485 


.481 


.480 


.463 


.500 


I 2 

Accuracy 


.575 


.603 


.634 


.615 


.639 


.624 


NMI 


.601 


.621 


.637 


.628 


.640 


.637 


E 1 

Accuracy 


.579 


.604 


.648 


.641 


.610 


.635 


NMI 


.588 


.606 


.639 


.631 


.628 


.633 


Gi 

Accuracy 


.535 


.561 


.581 


.562 


.578 


.576 


NMI 


.561 


.585 


.593 


.581 


.582 


.589 


Gi 

Accuracy 


.576 


.608 


.642 


.630 


.563 


.644 


NMI 


.584 


.603 


.631 


.622 


.620 


.632 


Hi 

Accuracy 


.570 


.584 


.636 


.641 


.549 


.642 


NMI 


.593 


.610 


.629 


.627 


.592 


.628 


Hi 

Accuracy 


.586 


.611 


.656 


.639 


.602 


.641 


NMI 


.598 


.616 


.646 


.634 


.628 


.638 


MixMulti 

Accuracy 


.534 


.620 


.679 


.677 


.621 


.651 


NMI 


.587 


.625 


.662 


.656 


.651 


.662 



c that each document is generated from is assumed to be hidden. Training such 
a model is carried out using the Expectation-Maximization algorithm [20]. In 
practice, smoothing the multinomial distributions is necessary. The mixture of 
multinomials algorithm is the unsupervised analogue of the Naive Bayes algo- 
rithm and has been successfully used in the past for document clustering [21]. 
Mixture models, in general, have been extensively used for data mining and 
pattern discovery [22]. 

The software package CLUTO 4 was used for the optimization-based algo- 
rithms. Using CLUTO, a number of different clustering methods (hierarchical, 
partitional and graph-partitioning) and criteria can be used. For example, the 
I 2 criterion maximizes the function vec k C0S ( M > ' u )> where Ck is the 

set of documents in cluster k and u,v are the tfidf vector representations of 
documents u, v respectively. The I 2 criterion attempts to maximize intra-cluster 
similarity. Other criteria, like E\, attempt to minimize inter-cluster similarity 



4 http://www-users.cs.umn.edu/~karypis/cluto/ 
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and yet other criteria, like H 2 , attempt to optimize a combination of both. For 
more information on the optimization criteria and methods, see [19]. 

Having determined the clustering algorithms to use, the next question is how 
to generate the systems to be combined. We may combine systems from different 
clustering algorithms, pick a single algorithm and generate different systems by 
resampling, or pick a single algorithm and use different initial conditions for each 
system. In this work we chose the last option. 

5.3 Results 

On all results reported in this work the direct clustering method was used for 
the CLUTO algorithms. For the single run case, the number reported is the 
average of 100 independent runs. For the best of 100 runs case, the number is the 
average of 10 runs where each run selects the system with the highest objective 
function out of 100 trials. A trial is an execution of a clustering algorithm with a 
different initial condition. For the metaclustering schemes, the final clustering is 
performed, with the default values of CLUTO. 100 runs of the CLUTO algorithm 
are performed and the one with the highest objective function selected. 

In Table 1, the performance of the three combination schemes applied on 
eight different clustering algorithms on 20Newsgroups is shown. For every clus- 
tering algorithm except I\, we can observe significant gains of the combination 
schemes compared to a single run or selecting the system with the highest ob- 
jective function. The results show that the SVD combination outperforms the 
constrained combination which in turn outperforms the unconstrained combi- 
nation. This suggests that the constraints introduced are meaningful and lead 
to improved performance over the unconstrained scheme. Also shown in Table 
1 are the results from not using any combination scheme. This means that the 
clusters of different systems are not combined but rather the cluster posteriors 
for all systems are used as a new document representation. This corresponds to 
using matrix R from subsection 3.2 without any dimensionality reduction. This 
is the approach taken in [13]. From Table 1, we see that for the MixMulti case 
there are gains from using SVD combination rather than using no combination of 
clusters at all. For other systems, gains are small or differences are insignificant, 
except for I\ again where accuracy degrades significantly. 

In Table 2, the performance of the three combination schemes over the same 
eight algorithms on a 1000-document subset of Reuters-21578 is shown. The 
same trends as in Table 1 seem to hold. Combination appears to offer significant 
improvements for all clustering algorithms, with SVD combination having a lead 
over the other two combination schemes. In most cases, SVD combination is 
better than the best individual clustering system. As in Table 1, the constrained 
scheme is superior to unconstrained but not as good as SVD combination. 

In Table 3 the experiments are repeated for the Switchboard corpus. In con- 
trast to previous tables, the combination schemes do not offer an improvement 
for the CLUTO algorithms and for the unconstrained scheme there is even a 
degradation compared to the single run case. However, the mixture of multino- 
mials records a very big improvement of about 40% on classification accuracy. 
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Table 2. Performance of different combination schemes on various clustering algo- 
rithms for a 1000-document subset of Reuters-21578. 





Single 


Best of 


SVD 


Constr. 


Unconstr. 


No 




Run 


100 runs 


Combin. 


Combin. 


Combin. 


Combin. 


II 

Accuracy 


.636 


.644 


.696 


.669 


.673 


.686 


NMI 


.697 


.697 


.735 


.711 


.725 


.726 


I 2 

Accuracy 


.709 


.797 


.838 


.838 


.764 


.808 


NMI 


.760 


.805 


.821 


.819 


.797 


.814 


Fi 

Accuracy 


.710 


.797 


.855 


.837 


.773 


.849 


NMI 


.745 


.790 


.830 


.819 


.799 


.822 


Gi 

Accuracy 


.652 


.660 


.707 


.721 


.705 


.709 


NMI 


.699 


.716 


.723 


.727 


.723 


.727 


Gi 

Accuracy 


.692 


.771 


.814 


.816 


.782 


.827 


NMI 


.730 


.771 


.797 


.800 


.790 


.804 


Hi 

Accuracy 


.709 


.822 


.844 


.834 


.789 


.835 


NMI 


.758 


.820 


.821 


.819 


.801 


.817 


11 ‘2 

Accuracy 


.719 


.814 


.854 


.849 


.799 


.828 


NMI 


.761 


.812 


.837 


.833 


.813 


.833 


MixMulti 

Accuracy 


.502 


.525 


.582 


.543 


.542 


.586 


NMI 


.597 


.609 


.658 


.644 


.633 


.651 



It is interesting to note that for the Switchboard corpus, although the mix- 
ture of multinomials method was by far the worse clustering algorithm, after 
SVD combination it clearly became the best method. The same happened for 
the 20Newsgroups corpus where the mixture of multinomials was among one of 
the worse-performing methods and after SVD combination it became the best. 
These results suggest that when developing clustering algorithms, issues of the 
performance of metaclustering are distinct than issues of performance of single 
systems. 



5.4 Factor Analysis of Results 

In this subsection we try to determine the relative importance of two factors in 
the combination schemes: the mean and variance of the classification accuracy 
of individual systems. Comparing Table 1 or 2 with Table 3 the gains in 20News- 
groups or Reuters are higher than Switchboard and the variance of individual 
systems is higher in 20Newsgroups and Reuters than Switchboard. To assess the 
effect of each one of these two factors (mean and variance of individual systems) 
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Table 3. Performance of different combination schemes on various clustering algo- 
rithms for Switchboard. 





Single 


Best of 


SVD 


Constr. 


Unconstr. 


No 




Run 


100 runs 


Combin. 


Combin. 


Combin. 


Combin. 


II 

Accuracy 


.819 


.848 


.826 


.820 


.789 


.836 


NMI 


.908 


.914 


.913 


.907 


.898 


.915 


I 2 

Accuracy 


.831 


.863 


.841 


.837 


.807 


.845 


NMI 


.913 


.920 


.920 


.918 


.910 


.922 


£1 

Accuracy 


.798 


.819 


.819 


.777 


.736 


.818 


NMI 


.882 


.886 


.890 


.883 


.863 


.891 


Gi 

Accuracy 


.711 


.711 


.765 


.751 


.741 


.762 


NMI 


.868 


.870 


.887 


.877 


.875 


.888 


Gi 

Accuracy 


.789 


.808 


.811 


.801 


.749 


.803 


NMI 


.875 


.878 


.880 


.877 


.859 


.878 


Hi 

Accuracy 


.826 


.861 


.842 


.811 


.757 


.841 


NMI 


.910 


.918 


.918 


.899 


.895 


.918 


11 ‘2 

Accuracy 


.814 


.845 


.840 


.817 


.773 


.830 


NMI 


.897 


.903 


.905 


.900 


.886 


.901 


MixMulti 

Accuracy 


.635 


.699 


.888 


.756 


.739 


.876 


NMI 


.787 


.818 


.924 


.899 


.892 


.921 



we generated 300 systems and chose a set of 100 for metaclustering depending 
on high/medium/low variance and similar mean (Table 4) or high/medium/low 
mean and similar variance (Table 5) . The results of Table 4 do not show a signif- 
icant impact of variance on the combination results. The results of Table 5 show 
a clear impact of the mean on the combination results. However, from Tables 1, 
2 and 3 we know that the performance of the combined system does not depend 
simply on the performance of the individual systems: the MixMulti result for 
Switchboard compared with the CLUTO results is a counterexample. It appears 
that there are unexplained interactions of mean, variance and clustering algo- 
rithms that will make the combination more successful in some cases and less 
successful in other cases. 

6 Summary 

We have presented three new methods for the combination of multiple cluster- 
ing systems and evaluated them on three major corpora and on eight different 
clustering algorithms. Identifying the correspondence between clusters of differ- 
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Table 4. Effect of combining sets of 100 systems with approximately the same mean 
and different levels of variance. The (stdev,acc) cells contain the standard deviation 
and mean of classification accuracy for each set. Systems are generated with the E\ 
criterion on 20Newsgroups and combined with SVD. 



Low Medium High 

Variance Variance Variance 

(stdev,acc) (.010..577) (.023, .578) (.056, .580) 

Accuracy .640 .631 .635 

NMI .630 .629 .633 



Table 5. Effect of combining sets of 100 systems with approximately the same variance 
and different levels of mean. The (stdev,acc) cells contain the standard deviation and 
mean of classification accuracy for each set. Systems are generated with the E\ criterion 
on 20Newsgroups and combined with SVD. 





Low 

Mean 


Medium 

Mean 


High 

Mean 


(stdev,acc) 


(.018,.538) (.010,.577) 


(.019, .617) 


Accuracy 


.581 


.641 


.669 


NMI 


.616 


.632 


.647 



ent systems was achieved by “clustering clusters”, using constrained or uncon- 
strained clustering or by applying SVD. We have empirically demonstrated that 
the combination schemes can offer gains in most cases. Issues of combination of 
multiple runs of an algorithm can be important. The combination of different 
runs of mixture of multinomials algorithm was shown to outperform seven state- 
of-the-art clustering algorithms on two out of three corpora. In the future we 
will attempt to gain a better understanding of the conditions under which poor 
individual systems can lead to improved performance when combined. 
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Abstract. Data reduction is a basic step in a KDD process useful for 
delivering to successive stages more concise and meaningful data. When 
mining is applied to data streams, that are continuous data flows, the 
issue of suitably reducing them is highly interesting, in order to arrange 
effective approaches requiring multiple scans on data, that, in such a 
way, may be performed over one or more reduced sliding windows. A 
class of queries, whose importance in the context of KDD is widely ac- 
cepted, corresponds to sum range queries. In this paper we propose a 
histogram-based technique for reducing sliding windows supporting ap- 
proximate arbitrary (i.e. , non biased) sum range queries. The histogram, 
based on a hierarchical structure (opposed to the flat structure of tradi- 
tional ones), results suitable for directly supporting hierarchical queries, 
and, thus, drill-down and roll-up operations. In addition, the structure 
well supports sliding window shifting and quick query answering (both 
these operations are logarithmic in the sliding window size). Experimen- 
tal analysis shows the superiority of our method in terms of accuracy 
w.r.t. the state-of-the-art approaches in the context of histogram-based 
sliding window reduction techniques. 



1 Introduction 

It is well known that data pre-processing techniques (data cleaning and data 
reduction), when applied prior to mining, may significantly improve the over- 
all data mining results. This is particularly true in the context of data stream 
mining, where data comes continuously and mining may be done on the basis 
of sliding windows including only the most recent data [1]. Indeed, in order to 
give significance to the sliding window itself, the size, that is the number of most 
recent data we keep in each instant, should be as large as possible. As a con- 
sequence any technique capable of reducing (i.e., compressing) sliding windows 
by maintaining a good approximate representation of data distribution inside it, 

* This work was partially funded by the Italian National Council Research under the 
“Reti Internet: efficienza, integrazione e sicurezza” project and by the European 
Union under the “SESTANTE - Strumenti Telematici per la Sicurezza e l’Efhcienza 
Documentale della Catena Logistica di Porti e Interporti” Interreg III-B Mediterra- 
nee Occidentale project 
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(c) Springer- Verlag Berlin Heidelberg 2004 




76 



Francesco Buccafurri and Gianluca Lax 



and, at the same time, by smoothing possible outliers, is certainly relevant in 
the held of data stream mining. Observe that, reducing sliding windows allows 
us also to keep simultaneously more than just one approximate sliding window, 
in order to implement similarity queries and other analysis, like change mining 
queries [12], useful for trend analysis and, in general, for understanding the dy- 
namics of the data stream. In sum, since in a typical streaming environment, 
only limited memory resources are available [14], reduction is a key factor al- 
lowing us query processing also requiring multiple scans on data. But, which 
properties a sliding window reduction technique has to satisfy? Necessarily, the 
reduced sliding window should maintain in a certain measure the semantic na- 
ture of original data, in such a way that meaningful queries for mining activities 
can be submitted to reduced data in place of original ones. Then, for a given 
kind of query, accuracy of the reduced structure should be enough independent 
of the position where the query is applied. Indeed, mining needs the possibility 
of freely querying data. In addition, the reduction technique should not to limit 
too much the capability of drilling-down and rolling-up data. 

In this paper we propose a histogram-based technique for reducing sliding 
windows supporting approximate arbitrary range-sum queries satisfying all the 
above properties. Observe that range-sum queries represents a class of queries 
very frequent in the field of data stream mining. Our histogram, called c-tree, 
differently from traditional ones, is based on a hierarchical structure. Its nodes 
contain, hierarchically, pre-computed range-sum queries, stored by approximate 
(via bit saving) encoding. For this reason, the structure directly supports the 
estimation of arbitrary range-sum queries (indeed, range-sum queries are either 
embedded in the histogram or derivable by linear interpolation by the latter 
ones). Reduction derives both from aggregation implemented by leaves of the 
tree (discretization), and from the saving of bits obtained by representing range 
queries with less than 32 bits (assumed enough for an exact representation). 
The number of bits used for representing range queries decreases as the level 
of the tree increases. The structure is designed as dynamic, in the sense that 
each update, for maintaining the c-tree on the sliding window, can be applied in 
logarithmic time in worst case (w.r.t. the window size). Moreover, answering to 
a range query requires at most logarithmic time too. Observe that hierarchical 
structure directly supports querying at different abstraction levels, thus allowing 
drill-down and roll-up operations. Finally, bucket summarization smoothes each 
data value by consulting the “neighborhood” or values around it. This works 
to remove the noise from data. But the main feature we have to remark for 
our histogram concerns its accuracy. Indeed, in order the reduction technique to 
have significance, error should be either guaranteed or lreuristically shown to be 
low (and this is our case), compared with that of the state-of-the-art techniques. 
There is uo large literature about the important issue of evaluating approximate 
arbitrary range queries on sliding windows. Most of the recent approaches are 
based on histograms [17,16] and Wavelet [15,21]. Other approaches use sam- 
pling [2, 19, 8] and sketches [7, 13]. Histograms are a lossy compression technique 
widely applied in various application contexts, like query optimization, statistical 




Reducing Data Stream Sliding Windows by Cyclic Tree-Like Histograms 



77 



and temporal databases, and OLAP applications. [11] deals with the problem of 
reducing sliding windows by error guaranteed histograms (called exponential his- 
tograms ), by solving the problem only in case of biased range queries (i.e., queries 
involving the last q data of the sliding window) that are significant queries in 
the context of data stream processing. For arbitrary queries, error may increase 
dramatically, especially if queries are distant from the most recent data of the 
sliding window. The proposal of [6] presents the same characteristics. Unfortu- 
nately, limiting range queries to a particular case is not acceptable in the KDD 
context, as observed earlier. Thus, having a technique not focused ou analytic 
error guarantee, but experimentally shown to be uniformly accurate w.r.t. ar- 
bitrary range queries is strongly preferable to biased (even error guaranteed) 
methods. 

Validation of our method is conducted experimentally by comparing the c- 
tree with the V-optimal histogram [18] . We have chosen such a histogram since its 
superiority (in terms of accuracy) w.r.t. the state-of-the-art proposals is proven 
in a recent paper [16]. Actually, [16] concerns a more efficient version of V- 
optimal (called e-approximate V-optimal histogram) defined in order to have an 
effective method (since V-optimal updating is polylinear). But, in the same pa- 
per, it is shown that the e-approximate V-optimal histogram is (slightly) less 
accurate than the classical V-optimal defined in [18]. This guarantees the signifi- 
cance of our comparison. In addition, we observe that while our method requires 
0(log w) time for answering to a range query (where w is the size of the slid- 
ing window), V-Optimal requires 0(w) time (by assuming that the number of 
bucket is linear on the sliding window size). [16] shows that the e-approximate 
approach is also definitely superior w.r.t. Wavelet approach [21]. Anyway, we 
perform comparisons also with Wavelet histograms [20]. Observe that, in order 
to increase significance of our experiments, we have used exact V-optimal [18] 
and Wavelet [20] histograms, that are more accurate than histograms defined 
in [16] and [21], respectively. Indeed, the latter two approaches were introduced 
since both [18] and [20] are not effective in the context of data streams due to 
their high maintenance computational cost. 

The plan of the paper is the following. The c-tree histogram is introduced 
in Section 2 where we describe also the dynamics of the c-tree, that is how 
it is updated while the sliding window is moved. Section 3 shows how the c- 
tree histogram is represented; moreover some considerations about the proposed 
approach are remarked. In Section 4 we show how to use the c-tree for evaluating 
range queries. Section 5 analyzes experimental results validating the method. In 
Section 6 we draw our conclusions. 

2 The c-Tree Histogram 

In this section we describe the core of our proposal, consisting of a tree-like 
histogram, named c-tree, used for managing data streams under sliding windows. 
A sliding window of size w 1 (on a data stream D) is the tuple (x w ,x w -\, . . . , aq) 

1 For simplicity, we assume that w = 2 Z for a given positive integer 2 > 0 
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containing the ic-most recent values of D in arrival ordering (observe that x w 
represents the oldest value whereas x\ is the most recent one). Clearly, at each 
new time instant the sliding window is updated by inserting from the right hand 
the new value of D and deleting the left-most one (corresponding to the oldest 
value of the sliding window) . 

The histogram is built on top of the sliding window, by hierarchically sum- 
marizing the values occurring in it. In order to describe the c-tree we chose a 
constructive fashion. In particular we define the initial configuration (at the time 
instant 0 - coinciding with the origin of the data flow) and we show, at a generic 
instant coinciding with the arrival of a new data, how the c-tree is updated. 
Initial Configuration. The c-tree histogram consists of: 

1. A full binary tree T with n levels, where n is a parameter set according to 
the required data reduction (this issue will be treated in Section 5). Each leaf 
node N of T is associated with a range (l(N),u(N)) of size d = and the set 
of such ranges produces an equi- width partition of the array (l,w). In addition, 
we require that adjacent leaves correspond to adjacent ranges of (1 ,w) and the 
left-most leaf corresponds to the range (1 ,d). We denote by Val(N) the value 
of a node N. In the initial state, all nodes of T contains the value 0. 

2. A buffer (of size 2) B = (e, s), where 0 < e < d and s > 0. s represents the 
sum of the e most recent elements of the sliding window. Initially, e = s = 0. 

3. An index P, with 1 < P < 2 n ~ 1 , identifying a leaf node of T . P is initially 
set to 1, and thus it identifies the left-most leaf of T . 

We denote by H the above data structure. Now we describe how H is updated 
when new data arrive. 

State Transition. Let x t be the data coming at the instant t > 0. Then, e := 
(e + 1) mod d and s := s + Xt- Now, if e 0 (i.e., the buffer B is not full), then 
the updating of H halts. Otherwise (i.e., e = 0), the value s (which summarizes 
the last d data) has to be stored in T and, then, the buffer has to be emptied. 
We explain now how the insertion of s in T is implemented. Let 5 = s- val(Np), 
where Np is the leaf of T identified by P. Then, val(Np) val(Np) + s and S is 
also added to all nodes belonging to the path from N P to the root of H. Finally, 
e and s of B are reset (i.e., they assume value 0) and P := P mod 2 n ~ 1 + 1 
(this way, leaf nodes of the tree are managed as a cyclic array). Observe that 
P points to the leaf node containing the less recent data, and such data are 
replaced by new data incoming. Each update operation requires 0(log w) time, 
where w is the size of the sliding window. Now we show an example of 3-levels 
c-tree building and updating. 

Example 1. Let (35, 51,40, 118, 132, 21, 15, 16, 18, 29, ...) be the data stream or- 
der by arrival time increasing and let the sliding window size be 8; moreover, 
let d = w/ 2 n ~ 1 = 2. Initially, e = 0, s = 0, P = 1 and the value of all nodes 
of T is 0. The first data coming from the stream is 35, thus e = 1 and s = 35. 
Since e ^ 0 no other updating operation has to be done. Then, the data coming 
from the stream is 51, thus e = 0 and s = 35 + 51 = 86. Since e = 0, the first 
leaf node of T is set to the value s, and all nodes belonging to the path between 
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Fig. 1. The c-tree of Example 1 



such leaf and the root are increased by 5 = 86. Finally, P = 2, and e and s are 
reset. In Figure l.(a) the resulting c-tree is reported. Therein (as well as in the 
other figures of the example), we have omitted buffer values since they are null. 
Moreover, right-hand child nodes are represented with the color grey. This is 
because, as we will explain in Section 3, these nodes have not to be saved since 
they can be computed from white nodes. For the first 8 data arrivals, updates 
proceeds as before. In Figure l.(b) and l.(c) we report just the snapshot after 4 
and 8 updates, respectively. The pointer P is now changed assuming the value 
1. Now, the data 18 arrives. Thus, e = 1 and s = 18. At the next time instant, 
e = 0 and s = 47. Since e = 0, <5 = 47 — 86 = —39 is added to the leaf node 
pointed by P (that is the first leaf) determining its new value 47. Moreover, 
nodes belonging to the path between such leaf and the root are increased by 5. 
At this point, P assumes the value 2. The final c-tree is shown in Figure l.(d). 



3 c-Tree Representation 

In this section we describe how the c-tree histogram is represented. Beside storing 
just necessary nodes, we use a bit-saving based encoding in order to reduce 
storing space. As already sketched in Example 1, each right-hand child node 
can be derived as a difference between the value of the parent node and the 
value of the sibling node. As a consequence, right-hand child node values have 
not to be stored. In addition, we encode node values through length- variable 
representations. In particular: (1) The root is encoded by 32 bits (we assume 
that anyway the overall sum of the sliding window data can be represented by 
32 bits with no scaling error). (2) The root left-child is represented by k bits 
(where k is a parameter suitably set - we will discuss about such an issue in 
Section 5). (3) All nodes which belong to the same level are represented by the 
same number of bits. (4) Nodes belonging to a level, say l, with 2 < l < n — 1, 
are represented by a bit less than nodes belonging to the level l — 1. 
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Substantially, the approach is based on the assumption that, in the average, 
the sum of occurrences of a given interval of the frequency vector, is twice than 
the sum of the occurrences of each half of such an interval. This assumption is 
chosen as a heuristic criterion for designing c-tree, and this explains the choice 
of reducing by 1 per level the number of bits used for representing numbers. 
Clearly, the sum contained in a given node is represented as a fraction of the 
sum contained in the parent node. Observe that, in principle, it could be used also 
a representation allowing possibly different number of bits for nodes belonging to 
the same level, depending on the actual value contained into nodes. However, we 
should deal with the spatial overhead due to these variable codes. The reduction 
of 1 bit per level appears as a reasonable compromise. Our approach is validated 
by previous results shown in [3, 4] for histograms on persistent data and in [5] 
for improving estimation inside histogram buckets. 

Remark. We remark that this bit-saving approach is not applicable to non- 
indexed histograms (by a tree). Indeed, for a “flat” histogram, the scaling size 
used for representing numbers would be related to the overall sliding window sum 
value, that is, bucket values would be represented as a fraction of this overall 
sum, with a considerable increasing of the scaling error. One could argue that 
also data-distribution-driven histograms, like V-Optimal [18], whose accuracy 
has been widely proven in the literature, could be improved by building a tree 
index on top, and by reducing the storage space by trivially applying our bit- 
saving approach. However, such indexed histograms, induce a non equi-width 
partition, and, as a consequence, the reduction of 1 bit per level in the index of 
our approach would be not well founded. 

Encoding a given node N with a certain number of bits, say i, is done in a 
standard fashion. Let denote by P the parent node of N. The value val(N) of the 
node N will be recovered not exactly, in general. It will be affected by a certain 
scaling approximation. We denote by val l (N) the encoding of val(N) done with 
i bits and by val l (N) the approximation of val(N) obtained by val l (N). 

We have that: val l (N) = Round( • (2* — 1)). Clearly, 0 < val l (N) < 2 l — 1. 

Concerning the approximation of val(N) it results: val l (N) = ( ■ val(P)) 
The absolute error due to the i-bit encoding of the node N, with parent node 
P, is: e a (val(N),val(P),i ) = \val(N) — val i (N)\. It can be easily verified that: 
0 < e a (val(N),val(P),i) < . 

We conclude this section by analyzing both overall scaling error and storage 
space required by the c-tree, once the two input parameters are fixed, that is: n, 
i.e., the number of levels n and k, i.e., the number of bits uses for encoding the 
left-hand child of the root. Concerning scaling error we have to understand how 
it is propagated over the path from the root to the leaves of the tree. Indeed, 
the error for a stand-alone node is analyzed above. We may determine an upper 
bound of the worst-case error by considering the sum of the maximum scaling 
error at each stage. Assume that R is the maximum value appearing in the data 
stream and w is the sliding window size. According to considerations above, since 
at the first level we use k bits for encoding numbers, the maximum absolute error 




Reducing Data Stream Sliding Windows by Cyclic Tree-Like Histograms 



81 



at this level is Tpqrr . Going down to the second level cannot increase the maximum 
error. Indeed, we double the scale granularity (since coding is reduced by 1 bit) 
but the maximum allowed value is halved. More precisely, the maximum absolute 
error at the second level is -^jr- Clearly, the same reasoning can be applied to 
lower levels, so that the above claim is easily verified. In sum, the maximum 
absolute scaling error of the c-tree is interestingly, observe that the error 

is independent of the tree depth n. 

Concerning the storage space (in bits) required by the c-tree, we have: 

71 — 2 

(n - 1) + riog(i? • d ) ] + flog(d)l + 32 + ^(k - h) ■ 2 h (1) 

h—0 

where d = and the first three components of the sum takes account of P, 
s and e, respectively, while 32 + J2h= o(^ — h) ■ 2 h is the space required for saved 
nodes of T (recall that only left child nodes are stored). In Section 5 we will 
discuss about the setting of the parameters n and k. 

4 Evaluation of a Range-Sum Query 

In this section we describe the algorithm used for evaluating the answer to a range 
query Q(ti,t 2 ), where 0 < t\ < t 2 < w, that computes the sum of data arrived 
between time instants t — t\ and t — t 2l respectively, where t denotes the current 
time instant. For example, if t\ = 0 and t 2 = 5 it represents the sum of the 5 
most recent data. C-tree allows us to reduce the storage space required for storing 
data of the sliding windows and, at the same time, to give fast yet approximate 
answers to range queries. As usual in this context, this approximation is the price 
we have to pay for having a small data structure to manage and for obtaining 
fast query answering. We now introduce some definitions that will be use in the 
algorithm. 

Notations: Given a range query Q(t\,t 2 ), with U > e: (1) Let be the set 
of leaf nodes containing at least one data involved in the range query. Let li = 
(P — ceil(^ i ^))/2 rl_1 , with i = 1, 2 be the indexes of the two leaf nodes L\ and 
L 2 . ij consists of all leaf nodes succeeding L 2 and preceding L i (including both 
Li and L 2 ) in the ordering obtained by considering leaf nodes as a cyclic array. 
(2) Given a non leaf node of N, let L(N) be the set of leaf nodes descending 
from N. (3) Given a leaf node N, we define I(N , Q) = 2 ' val(N), where i is the 
number of data stored by N involved in Q. I(N , Q) computes the contribution of 
a leaf node to a range query (by linear interpolation) . (4) Let Q be the estimation 
of the range query computed by means the c-tree. 

First, suppose that t\ > e (recall e is the number of data in the buffer B) 
in such a way that the range query doesn’t involve data in the buffer B. The 
algorithm for the range query evaluation is performed by calling the function 
contribution on the root of the c-tree. 
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The function contribution is shown below: 

function contribution (N ) 
if ( N is a leaf) 

Q = Q + /(IV, Q) 

return Q //the function halts, 
endif else 

for each N x child of N 

if (L(iy*) cjj) 

Q = Q + val(N x ) endif 
if (L(N X ) n 77 ^ 0 ) 

contribution (N x ) endif 

endfunction 

The first test checks if N is a leaf node, and in such a case the function, before 
halting, computes the contribution of N to Q by linear interpolation. In case N 
is not a leaf node, it is tested if all nodes descending from N x (denoting a child 
of N) are involved in the query. If this is the case, their contribution to the range 
query coincides with the value of N x . In case not all nodes descending from N x 
are involved in the query, but only some of them, their contribution is obtained 
by recursively calling the function on N x . The algorithm performs, in the worst 
case, two descents from the root to two leaves. Thus, asymptotic computational 
cost of answering a range query is 0(log w) where, w is the window size. Note 
that the exact cost is upper bounded by n, where n = [log^] + 1 is the number 
of levels of the c-tree and d, is the size of leaf nodes. 

In case ti < t% < e, the range query involves only data in the buffer B and 
Q(ti , ^ 2 ) = (' £2 — ii) • - (recall that s represents the sum of data buffered in B). 
Finally, in case t\ < e < tv , we have that Q{t\, £ 2 ) = Q(t\,e) + Q(e , £ 2 ) that can 
be computed by exploiting the two above cases. 

5 Experiments 

We start this section by describing the test bed used for our experiments. 
Available Storage: For experiments conducted we have used 22 four-byte num- 
bers for all techniques. According to (1) (given in Section 3), the above constraint 
has to be taken into account when the two basic parameters n and k of the c-tree 
are set. We have chosen to fix these parameters to values n = 7 and k = 14 (we 
will motivate such a choice next in this section). 

Techniques: We compare our technique with (the motivations of such a choice 
are given in the Introduction): (1) V-Optimal (VO) [18], which produces 11 
bucket; for each bucket both upper bound and value are stored; (2) Wavelet 
(WA) [20], which are constructed using the bi-orthogonal 2.2 decomposition of 
the MATLAB 5.3 with 11 four-byte Wavelet coefficients plus another 11 four- 
byte numbers for storing coefficient positions. 

Synthetic Data Streams: Synthetic data streams are obtained by randomly 
generating 10000 data values belonging to the range [0, 100]. 
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Real-Life Data Streams: Real-life data have been retrieved from [10] and rep- 
resent the daily maximum air temperature stored by the station STBARBRA.A 
in the County of Santa Barbara from 1994 to 2001. Its size is 2922 and the range 
is from 10.6 to 38.3 degree Celsius. 

Query Set and Error Metric: In our experiments, we use two different query 
sets for evaluating the effectiveness of the various methods: (1) QS\ consists 
of all range queries from 1 to q with 1 < q < w and (2) QS 2 is the set of all 
range queries having size round(j^), where, we recall, w is the size of the sliding 
window. At each time we measure the error E(t) produced by techniques on the 
above query set by using the average of the relative error e[ eJ , where Q is 

the cardinality of the query set, and e^ el is the relative error , i.e., e^ el = , 

where Si and Si are, respectively, the actual answer and the estimated answer 
of the query i-tli of the considered query set. Then we compute the average of 
the error E(t) over the entire data stream duration. After a suitable initial delay 
sufficient to fill the sliding window, queries are applied at each new arrival. 

Sliding Window Size: In our experiments, we use sliding windows of size 64, 
128, 256, 512, 1024, that are dimensions frequently used for experiments in this 
context (e.g., see [6,9,16]). 

Now we consider the problem of the choice of a suitable value for n and k, 
that are, we recall, number of levels of the c-tree and number of bits used for 
encoding the left child node of the root (for the successive levels, as already 
mentioned, we drop 1 bit per level) respectively. Observe that, according to 
the result about the error given in Section 3, setting the parameter k means 
fixing also the error due to scaling approximation. We have performed some 
experiments on synthetic data in order to test the error dependence on the 
parameters n and k for different window sizes by using the average relative error 
on query set 1. Experiments produce similar curves (not reported here for space 
limitations) showing that the error decreases as n increases and decreases as 
k increases until k = 11 and then it remains near constant. Indeed, the error 
consists of two components: (1) the error due to the interpolation inside the 
leaves nodes partially involved in the query, and (2) the scaling approximation. 
For k > 11, the last component is negligible, and the error keeps a quasi-constant 
behavior since the first component depends only on n. Therefore, in order to 
reduce the error, we should set Ho a value as large as possible allowing us 
to represent leaves with a sufficient number of bits (not to much lower than 
the threshold heuristically determined above) . However, for a fixed compression 
ratio, this may limit the depth of the tree and, thus, the resolution determined 
by the leaves. As a consequence, the error arising from linear interpolation done 
inside leaf nodes increases. In sum, the choice of k plays the role of solving the 
above trade-off. These criteria are employed in experiments in order to choose 
the value of n and k, respectively, on the basis of the storage space amount. 

Now we present results obtained by experiments. For each data set we have 
calculated the average relative error on both query set 1 and query set 2. 
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In Figures 2. (a) and 2.(b) we have reported results obtained on real and syn- 
thetic data sets, respectively varying the sliding window size (we have considered 
sizes: 64, 128, 256, 512) and using query set 1. C-tree shows the best accuracy, 
with significant gaps especially respect to Wavelet. Note that c-tree in case of 
sliding window of size 64 does not produce error since there is no discretiza- 
tion and, furthermore, a leaf node is encoded by 9 bits which are sufficient to 
represent exactly a single data value. 

In Figures 3. (a) and 3.(b) we have replicated the previous experiment con- 
sidering the behavior of techniques on query set 2. Observe that accuracy of 
techniques becames worse on query set 2 since the range query size is very small 
(indeed, the range query involves only the 10% of sliding window data). Also this 
comparison shows the superiority of the c-tree over other histogram methods. In 
Figure 4. (a) and 4.(b) we have studied the accuracy of c-tree versus the number 
levels, fixing k = 14, with sliding windows of size 256, 512 and 1024. In this 
experiment we have used the query set 1. Finally, we observe that, thanks to 
experiments conducted with query set 2, we have verified that the behaviour of 
the c-tree is “macroscopically” independent of the position of the range query in 
the window. Macroscopically here means that even though some queries can be 
privileged (for instance those involving only entire buckets), it happens that both 
average and variability of the query answer error is not biased. This basically 
reflects the equi-width nature of the c-tree histogram. 





(a) : Real-life Data (b) : Synthetic Data 

Fig. 2. Error for query set 1 



6 Conclusions and Future Work 

In this paper we have presented a tree-like histogram used for reducing slid- 
ing windows and supporting fast approximate answers to arbitrary range-sum 
queries on them. Through a large set of experiments, the method is success- 
fully compared with the most relevant proposals among the related histogram- 
based approaches. The histogram is designed for implementing data stream pre- 
processing in a KDD process which exploits arbitrary hierarchical range-sum 
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(a) : Real-life Data (b) : Synthetic Data 

Fig. 3. Error for query set 2 





(a) : Real-life Data (b) : Synthetic Data 

Fig. 4. Error for query set 1 



queries. Our feeling is that the c-tree histogram, as it encodes a concise multi- 
layer description of the sliding window, can be adapted for supporting further 
kinds of queries, also useful in the context of data stream mining. The study of 
this issue is left as a future research. 
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Abstract. To represent and manage data mining patterns, several as- 
pects have to be taken into account: (i) patterns are heterogeneous in 
nature; (ii) patterns can be extracted from raw data by using data mining 
tools (a-posteriori patterns) but also defined by the users and used for ex- 
ample to check how well they represent some input data source (a-priori 
patterns); (iii) since source data change frequently, issues concerning pat- 
tern validity and synchronization are very important; (iv) patterns have 
to be manipulated and queried according to specific languages. Several 
approaches have been proposed so far to deal with patterns, however all 
of them lack some of the previous characteristics. The aim of this paper 
is to present an overall framework to cope with all these features. 

1 Introduction 

In many different modern contexts, a huge quantity of raw data is collected. An 
usual approach to analyze such data is to generate some compact knowledge 
artifacts (i.e., clusters, association rules, frequent itemsets, etc.) through data 
processing methods, to make them manageable from humans while preserving as 
much as possible their hidden information or discovering new interesting correla- 
tions. Those knowledge artifacts, which can be very heterogeneous and complex, 
are also called patterns. Although a large variety of techniques for pattern min- 
ing exist, we still miss comprehensive environments supporting the development 
of knowledge intensive applications. Such an environment goes much beyond the 
use of pattern mining techniques; it has to provide support for combining hetero- 
geneous patterns, for characterizing their temporal behavior, and for querying 
and manipulating them. In what follows we elaborate on these requirements. 
Heterogeneity. There are many different application contexts from which var- 
ious types of patterns can be generated and need to be managed. For example, 
in the market-basket analysis, common patterns are association rules, which 
identify sets of items usually sold together, or clusters, used to realize a market 
segmentation analysis. Moreover, we may be interested not only in patterns gen- 
erated from raw data by using some data mining tools (a-posteriori patterns) 
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but also in patterns known by the users and used for example to check how well 
some data source is represented by them (a-priori patterns). 

Temporal information. Since source data change with high frequency, an im- 
portant issue consists in determining whether existing patterns, after a certain 
time, still represent the data source from which they have been generated, possi- 
bly being able to change pattern information when the quality of the representa- 
tion changes. Two different time information can be considered: (i) transaction 
time, i.e. the time the pattern “starts to live” in the system. For a-priori patterns, 
it is the instant when the user inserts the pattern in the system; for a-posteriori 
patterns, it is the instant when the pattern is extracted from raw data and 
inserted in the system; (ii) validity period, i.e., the time interval in which the 
pattern is assumed to be reliable with respect to its data source. The validity 
period can be either assigned by the user or by the system, depending on the 
quality of raw data representation ( semantic validity) achieved by the pattern. 

Pattern languages. Patterns should be manipulated (e.g. extracted, synchro- 
nized, deleted) and queried through a Pattern Manipulation Language (PML) 
and a Pattern Query Language (PQL). PML must support the management of 
a-posteriori and a-priori patterns. PQL must support both operations against 
patterns and operations combining patterns with raw data (cross-over queries). 

Several approaches have been proposed so far to deal with patterns, however 
all of them lack some of the previous characteristics. Most of them deal with 
specific types of a-posteriori patterns, often stored together with raw data, and 
do not consider temporal information [6,8,12-15]. However, as it has been rec- 
ognized [5], due to the quite different characteristics of raw data and patterns, 
to ensure an efficient handling of both, it could be better to use two dedicated 
systems: a traditional Data Base Management System (DBMS) for raw data and 
a specific Pattern Based Management System (PBMS) for patterns. 

In this paper we propose a comprehensive framework to deal with patterns 
within a PBMS, addressing the above requirements, and we develop in details 
some key notions of the framework, such as: (i) a temporal pattern representation 
model, allowing one to associate time and validity with patterns; (ii) a temporal 
pattern manipulation language (TPML) and a temporal pattern query language 
(TPQL), supporting specialized predicates and operators to deal with temporal 
information. To the best of our knowledge this is the first proposal dealing with 
temporal aspects of pattern representation and management. 

The remainder of the paper is organized as follows. In Section 2, the basic 
architecture and the pattern model are introduced. In Sections 3 and 4, the 
TPML and TPQL are discussed, respectively. Related work is then discussed in 
Section 5. Finally, Section 6 presents some conclusions and outlines future work. 

2 The Pattern Model 

Pattern-Base Management System. A Pattern-Base Management System 
(PBMS), first introduced in the context of the PANDA project [5], is a system 
for handling patterns defined over raw data. 
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Fig. 1 . PBMS Architecture 



The overall architecture of the system is shown in Fig. 1. The Data Man- 
agement System on the left-hand side of the figure deals with data collections, 
whereas the Pattern Management System, on the right-hand side, deals with 
patterns. The user may interact with both systems by mean of dedicated manip- 
ulation languages. Within the PBMS, we distinguish three different layers: (i) 
the pattern layer, which is populated with patterns (pattern-base); (ii) the pat- 
tern type layer , which holds built-in and user-defined types for patterns; (iii) the 
class layer, which holds definitions of pattern classes, i.e. , collections of patterns. 
End-users may directly interact with the PBMS: to this end, the PBMS adopts 
ad-hoc techniques not only for representing and storing patterns, but also for 
querying patterns or recalculating them from raw data. 

Basic Model Concepts. Based on the proposed architecture, the concepts 
at the basis of the pattern model are: pattern types, patterns, and classes (for 
additional details, see [15]). A pattern type is the intensional form of patterns, 
giving a formal description of their structure and relationship with source data. 
It is a record with five elements: (i) the pattern name n; (ii) the structure schema 
s, which defines the pattern space by describing the structure of the patterns 
instances of the pattern type; (iii) the source schema d, which defines the related 
source space by describing the dataset from which patterns are constructed; 

(iv) the measure schema m, which is a tuple describing the measures which 
quantify the quality of the source data representation achieved by the pattern; 

(v) the formula f, which describes the relationship between the source space 
and the pattern space, thus representing the semantics of the pattern. Inside /, 
attributes are interpreted as free variables ranging over the components of either 
the source or the pattern space. Note that, though in some particular domains 
/ may exactly express the inter-space relationship, in most cases it will describe 
it only approximatively. 

Given a pattern type pt, a mining function p for pt takes as input a data 
source, applies a certain computation to it, and returns a set of patterns, in- 
stances of pt. We then call measure function the function computing the measures 
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n: AssociationRule 
s: TUPLE(head:SET(STRING), 
body:SET(STRING)) 
d: BAG(transaction:SET(STRING)) 
m: TUPLE(confidence:REAL, support:REAL) 
f: \/x(x G headV x G body =>- x G transaction) 

n: ItemCluster 

s: TUPLE(representative:TUPLE(id:STRING, 
price: REAL, qty: REAL), max_dist:REAL) 
d: SET(product:TUPLE(id:STRING, price: REAL, 
qty:REAL)) 

m: TUPLE(AvglntraClusterDist:REAL) 

f: \/x G ps (dist(representative, x) < max_dist) 



pid: 512 

s: ( head ={’ Boots’ },body={ 'Socks’, ’Hat’}) 

d: 'SELECT SETOF(article) AS transaction 

FROM sales GROUP BY transld’ 
m: (confidence=0.75,support=0.55) 

f: Vx(x G {boots'} V x G {docks', 'Hat 7 } 

x G transaction) 



n: Frequentltemset 

s: SET(item: STRING) 

d: BAG(transaction: SET(STRING)) 

m: TUPLE(support: REAL) 

f: \/x{x G freq_set =>■ x G transaction) 



Fig. 2. Examples of pattern types and patterns 



of patterns over a certain dataset. We store such information in some catalog of 
the pattern layer. 

Patterns are instances of a specific pattern type containing: (i) a pattern 
identifier pid ; (ii) a structure that positions the pattern within the pattern space; 
(iii) a source that identifies the specific dataset the pattern relates to 1 ; (iv) a 
measure that estimates the quality of the raw data representation achieved by the 
pattern; (v) an instantiated formula, obtained from the one in the pattern type 
by instantiating each attribute appearing in s with the corresponding value, and 
letting the attributes appearing in d range over the source space. Dot notation 
and path expressions can be used to denote pattern components. 

A class is a set of semantically related patterns of a certain pattern type and 
constitutes the key concept in defining a pattern query language. 

Example 1. Consider the following scenario. A commercial vendor traces shop 
transactions and he applies data mining techniques to determine how he can 
further increase his sales. To this purpose, the vendor deals with several kinds of 
patterns: (i) association rules, representing correlations between items sold; (ii) 
clusters of products , grouping sold products with respect to their price and sold 
quantity; (iii) frequent itemsets, recording items most frequently sold together. 

As an example, consider the pattern type for association rules in Fig. 2. The 
structure schema is a tuple modeling the head and the body as sets of strings 
representing products. The source schema specifies that association rules are 
constructed from a bag of transactions, each defined as a set of products. The 
measure schema includes two common measures to assess the rule relevance: its 
confidence (what percentage of transactions including the head also include the 
body) and its support (what percentage of the whole set of transactions include 
both the head and the body) . The formula represents (exactly, in this case) the 
pattern/source data relationship by associating each rule with the set of trans- 
actions which support it. Now suppose that data related to sales transactions 
are stored in a relational table sales(transld, article, qty). Using an extended SQL 

1 When no otherwise stated, data sources are intensionally described as queries over 
the raw data. 
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syntax to denote the dataset, an example of an instance of AssociationRule (gen- 
erated, for instance, by using the Apriori algorithm [10]) is presented in Fig. 2. 

Other examples of pattern types are presented in Fig. 2. For instance, a 
cluster of products (represented by some numeric features) can be modeled by 
defining its representative element and the allowed maximal distance between 
each element in the cluster and its representative, whereas, a frequent itemset is 
just characterized by the set of items it represents. □ 

Pattern Validity. We extend the model proposed in [15] to deal with semantic 
and temporal validity issues. To this end, we assume that no temporal informa- 
tion is available from raw data and we associate each pattern with a transaction 
time and a validity period. Transaction time is automatically computed by the 
PBMS and points out when a pattern has been inserted in the system. This infor- 
mation cannot be changed by the user, thus it is just recorded in system catalogs. 
On the other hand, the validity period is the interval [StartTime, EndTime) in 
which the pattern can be considered reliable with respect to raw data, and 
therefore usable. The validity period can be queried by the user, thus it must 
be inserted in the model. Moreover, we suppose that the validity period can be 
either assigned and managed by the user or by the system, depending on the 
operations performed over patterns (see Section 3) . For the sake of simplicity, we 
deal with a fixed time granularity tg , chosen by the PBMS administrator. Thus, 
the validity period schema is always of type [ StartTime : tg, EndTime: tg) where 
tg is fixed. Each pattern is then extended with a new component vt, representing 
the actual pattern validity period according to the chosen granularity. 

Temporal validity specifies that the pattern is assumed to be valid in that 
period. However, since raw data change with a high frequency, the pattern, in 
its validity period, may not correctly represent raw data it is associated with. 
To this end, we also introduce the concept of semantic validity with respect to a 
data source and the notion of safety , for patterns that are both temporally and 
semantically valid in a certain instant of time. Note that semantic validity (and 
thus safety) can only be checked at an instant-by-instant base, since we do not 
know how the data source will change in the future and how it was in the past. 

Definition 1. Let p be a pattern, with p.m =< mi : iq, ...,m„ : v n >, and t an 
instant of time. Suppose each measure m.j is associated with a boolean operator 
9i such that v\ 9ii>2 means that Vi is “better than” tq. p is temporally valid at 
t if t € p.vt. p is semantically valid at t with respect to a data source D with 
thresholds iq, ..., v n if and only if, D \= t p and p.m.mi 6i Vi, i = 1, ..., n. D \= t p 
means that p can be extracted from D at time t. p is safe at t with respect to a 
data source D with thresholds vi,...,v n if it is both temporally and semantically 
valid at t with respect to D and v\,...,v n . □ 

Semantic validity can be seen as a function of time. By checking semantic 
validity periodically, we may plot how measures change in the time. 

Example 2. Consider the pattern in Fig. 2 (say p). Suppose p.vt = [1- APR- 
04, 31-MAR-05), thus p is valid from l-APR-04 to 31-MAR.-05. For instance, p is 
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temporally valid on 22-MAY-04. However, since raw data is continuously chang- 
ing, it may be possible that, on 22-MAY-04, no transaction in the p data source 
(say D) contains both 'Hat' and 'Boots'. Thus, the support and the confidence 
of p in D on 22-MAY-04 are 0. Thus, on 22-MAY-04, p is not semantically valid 
with respect to D , for any threshold values, and it is not safe. □ 

3 Temporal Pattern Manipulation Language (TPML) 

TPML must support primitives to generate patterns from raw data, to insert 
them in the PBMS, to delete, and to update patterns. These operations are 
defined by taking into account validity issues and differences between a-posteriori 
and a-priori patterns. 

Insertion Operations. To cope with both a-posteriori and a-priori patterns, 
three different types of insertions are supported: extraction, direct insertion, and 
recomputation. Insertion operators must set the validity period of the interested 
patterns, which is an additional parameter of the operator. If the user does not 
specify any validity period, it is set by default to [Current-time, + oo). 
Extraction £(pt,d,cond, p,pr) extracts patterns of pattern type pt from a raw 
dataset d by applying mining function p. To these (a-posteriori) patterns the 
validity period pr is assigned and they are inserted in the PBMS if they satisfy 
condition cond, defined by using predicates that will be presented in Section 4.1. 
Direct Insertion T(pt, d, s, m,pr) allows the user to insert in the PBMS patterns 
from scratch (a-priori patterns) by taking as input a pattern type pt, a source 
d, a structure s, a tuple of measure values m, and a validity period pr. 
Recomputation TZ(pt, cond, d, p m ,pr) generates new patterns from old ones, by 
recomputing their measures over a given raw dataset. More precisely, given the 
instances of a pattern type pt satisfying a given condition cond, the measures of 
those patterns over a raw dataset D are computed, accordingly to some input 
measure function p m . New patterns are created and inserted into the system. 

Example 3. Consider Example 1. At the end of every month, the vendor mines 
his transaction data to extract association rules and frequent sold item sets. A va- 
lidity period is assigned to each extracted pattern, from the first day till the last 
day of the month. In order to generate such patterns, the extraction operator can 
be used (Fig. 3 lines 2-3 and 5) against the relational view JuneSales, storing in- 
formation concerning sales in June, with mining functions p a Priori and pFreqSeq- 
The extracted patterns are then inserted in classes MinedAssociationRules and 
U sedFrequentltemsets, respectively (Fig. 3 lines 4 and 6) (see below for class- 
based operators). Furthermore, to specialize his advertising campaign, he mines 
clusters of products, based on numerical information concerning price and sold 
quantity of each product. Since the vendor does not know the expiration time of 
those clusters, he assumes they are always valid. To implement this behavior, the 
extraction operator is used against the relational view Products, storing infor- 
mation concerning sold products (Fig. 3 line 7), by using mining function psLink- 
Such patterns are then inserted in class SoldltemClusters (Fig. 3 lines 8). □ 
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1: /* Pattern generation */ 

2: AR — £ (AssociationRules, JuneSales , (confidence > 0.30, support > 0.25), 

3: PaPriori ■> [l — JUN — 04, 30-JUN-04 )) 

4: FORALL i £ Ai? DO Xc(i, Mined As sociationRules) 

5: FS = £(FrequentItemset , JuneSales , support > 0.30, PFreqSeq > [l-JUN-04, 30-JUN-04 )) 

6: FORALL j £ DiS DO Tc(j, UsedFrequentltemsets) 

7: OS' = £ (ItemCluster, Products, Av g I ntraC luster Dist < 0.20, ps Link) 

8: FORALL k £ OS DO Tc(k, SoldltemC lusters) 

9:/* Pattern deletion */ 

10: S(AssociationRule , support < 0.50) 

11:/* A-priori pattern management */ 

12: v — X (Frequent Item set, DS, {A, B}, < 0 >) 

13: Xc(v,U sedFrequentltemset) 

14: S(v,true, pFreqSeq) 

15: /* Restoring temporal validity */ 

16: (Current .time after vt,c) 

17: /* Promotion of a new product */ 

18: V17 — ^(<represenfatiDe>,<>) (S oldltemC luster s)) 

19: 14^ .representative's .head, cf (AtinedAsSOCiationRules) 

20: S (AssociationRule , true, PaPriori) 

21 : S (Frequent I temset, true, PFreqSeq) 

22 : S (ItemCluster, true, ps Link) 



Fig. 3. Pattern manipulation session 

Deletion Operator. S(pt,cond) removes the instances of pattern type pt sat- 
isfying condition cond from the pattern layer if they belong to no class. 

Example C Consider Example 1. Suppose the vendor is no more interested in 
association rules with support lower than 0.50. Thus he performs a deletion 
(Fig. 3 line 10). Note that association rules extracted at lines 2-3 are not deleted 
since they have already been inserted in a class. □ 



Update Operators. We assume that only measures and the validity period 
can be changed, by using the following operators. 

New-Period J\fp(pt, cond,pr) updates a set of patterns, instances of a pattern 
type pt and satisfying a condition cond , by setting the validity period to pr. 
Note that this operator does not recompute the measure values of the patterns. 
Synchronize S{pt, cond, p m ) makes patterns safe without changing the validity 
period. More precisely, it re-computes the measure values associated with tem- 
porally valid patterns, instances of a pattern type pt and satisfying a condition 
cond, to reflect data source modifications, by using the input measure function 
p m . Only temporally valid patterns are synchronized since all the others, by 
definition of safety, cannot become safe. 

Validate V(pt,cond, p m ) makes patterns safe by changing their validity period. 
More precisely, it first recomputes the measure values associated with temporally 
valid patterns, instances of a pattern type pt and satisfying a condition cond. 
If such measure values are better than the ones associated with patterns before 
validation, patterns are semantically valid. Thus, similarly to synchronization, 
measures are modified, and the validity period is left unchanged. On the other 
hand, if measures are worst than before, the validity period of the pattern is 
changed, setting the end time to Current-time and a new pattern is created, with 
the same structure and dataset than the previous one, but with the new measures 
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and validity period [Current -time, + oo). Since, after validation, patterns are 
semantically valid at the starting and ending points of their validity period, it is 
possible to use the validity period as an approximation of the periods in which 
a pattern is semantically valid. 

Example 5. Consider Example 1. After a certain period of time, the vendor 
receives information about sale transactions of a new shop (suppose those data 
are stored in a dataset DS) . Suppose he wants to trace information concerning 
how often a product A (e.g. milk) and another product B (e.g. cookies) are sold 
together in this shop. To this purpose, he first inserts the pattern representing 
such itemset in the system with measure value equal to 0. To this purpose, he 
uses a direct insertion operation (Fig. 3 lines 12-13) and then he synchronizes 
the pattern with raw data to get the right frequency (Fig. 3 line 14). □ 



Operators for Classes. According to our model, a pattern must be inserted 
in at least one class in order to be queried. Thus, two TPML operations are 
provided: insertion, Ic (p, c) , of a pattern p into class c, and deletion, T>c ( cond , c) , 
of all the patterns in class c satisfying condition cond. Note that the deletion 
operator just removes patterns from a class but leaves them in the system. 

Example 6. Consider Example 1. In general, the user may be interested in restor- 
ing the temporal validity of a certain class c, i.e. he may want to delete from 
c all patterns that are not temporally valid at the current time. Such behavior 
can be achieved by using the T>c operator, by using a temporal predicate in its 
condition (Fig. 3 line 16). □ 



4 Temporal Pattern Query Language (TPQL) 

TPQL supports the retrieval of patterns from the PBMS, taking temporal issues 
into account. Each operator of TPQL takes classes as input and returns a set of 
patterns as output. Moreover, cross-over operators, binding patterns with raw 
data, are provided. In the following, before presenting the TPQL operators, some 
useful predicates are identified. 

4.1 Pattern Predicates 

Predicates over Pattern Components. Let p\ and P 2 be two patterns. The 
general forms of a predicate over pattern components are t\0t2 and t\9o, where 
ti and t '2 are path expressions that denote components of patterns pi and p 2 , of 
compatible type, o is a constant suitable for the type of t\ , and 9 is an operator, 
suitable for the type of U , £ 2 , and o. We consider the following special cases: 

— If t\ and t 2 are data sources, then 9 e {=*, C®, = e , C e }. Constants o in this 
case are queries characterizing a dataset. ='' stands for equivalence and C* 
for containment between intensional data source descriptions (i.e., between 
queries). These predicates do not require accessing raw data and can be 
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checked by using results obtained in the literature for queries. On the other 
hand, = e and C e are checked by accessing raw data (thus, they are cross-over 
predicates). More precisely, t\ = e t 2 if and only if \/x (x € t\ x £ <2) and 
t\ C e t 2 if and only if \/x (x € t\ =>• x € t 2 ). 

— If t\ and t'2 are pattern formulas, then 9 £ {=, A}. t\ = f 2 is true if and only 
if t\ and t 2 are equivalent formulas; t\ -S t 2 is true if and only if t\ logically 
implies t 2 . Given a tuple o, containing one value for each free variable in t \ , 
ti(o) is true if and only if <1 instantiated with the values in o is true. 

— If <1 and t‘2 are validity periods, then 6 £ {equals, before, meets, overlaps, 
during, starts, finishes}. The meaning of such predicates is defined in [ 16 ]. 
o in this case is a temporal value, according to the chosen granularity. 



Predicates over Patterns. In the following, pi and p 2 are patterns. 

— Identity (=). p\ = p 2 if pi.pid = p2-pid. 

— Shallow equality (= s ). Pi = s P2 if their corresponding components, except 
for pid and the validity period v, are equal. For the data source, we consider 
intensional equality. 

— Intensional subsumption (S)- pi S p 2 if they have the same structure but 
Pi represents a smaller set of raw data, i.e. p\.s = P2-s, pi-d C* p 2 .d and 
Pi-f ^ P2-f- 

— Extensional subsumption (^ e ). p 1 p 2 if they have the same structure but 

Pi represents a smaller set of raw data through the considered formula, i.e. 
Pi.s = p 2 .s and pi-d^ Pl _f C p 2 .d \ P2 .f, where d\f represents the set of source 
data items satisfying the formula. 

— Goodness {/')■ pi S' p 2 if they have the same pattern type, pi S P2, 
and pi measures are better than p 2 measures, i.e., assuming that pt.m = 
(mi, ..., m n ), pi.m.i9iP2.mi, i = 1 , ...,n 2 . 

— Temporal validity (lot). Given a pattern p 1 and a temporal value t, u>T(pi,t) 
is true if and only if pi is temporally valid at time t. 

— Semantic validity (u>s). Given a pattern p of type pt, a data source D, a 
measure function for pt, and some thresholds vi,...,v n , LOs(p,D,p m ,< 
Vi, ...,v n >) is true if and only if p is semantically valid with respect to D 
and vi,...,v n , assuming to compute measure values by using p m . 

Note that S, St an d ws are cross-over predicates. 



4.2 Query Operators 

Basic Operators. In the PBMS framework, queries are executed against 
classes. Besides typical relational operators (such as renaming, set-based op- 
erators), several other query operators are proposed (see Table 1 ). For exam- 
ple, projection is revisited to project out structure and measure components. 
The selection operator allows one to select patterns belonging to a certain class 

According to Dcf. 1 , 9i is a predicate expressing that pi.m.i is “better than” p 2 .mi. 
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Table 1 . TPQL basic operators 



Name 


Operator 


Description 


Projection 


(c) where: 

c is a pattern class, is a non empty list 

of attributes appearing the pattern structure, 
and Ijrn a list of attributes appearing in the 
pattern measure 


it reduces the structure and the measures 
of the patterns in c by projecting out 
components not appearing in and Ijm 


Selection 


<tf(c) where: 

c is a class and F is a selection predicate 


it selects the patterns in c satisfying F 


Join 


c i ^F,cf c 2 where: 
ci and c 2 are two classes, 

F: join predicate , and cf : composition func- 
tion 


it combines patterns belonging to ci and 
C 2 , if they satisfy the join predicate F; 
each new pattern is generated by using 
the composition function cf 



satisfying a certain condition, using any predicate introduced in Section 4.1. 
When using cross-over predicates, it becomes a cross-over operator. Finally, the 
join operator combines patterns belonging to two different classes, with possibly 
different pattern types. It requires the specification of a join predicate and a 
composition function, which defines the pattern type of the result. 

Temporal Operators. Since we deal with temporal information associated 
with patterns, the need arises of querying such information. By using the pro- 
posed query operators (especially selection and join) and the predicates defined 
over validity periods (see Section 4) , several interesting temporal queries can be 
specified. For instance, the user may be interested in retrieving from a certain 
class c, at a fixed instant of time (e.g. ‘now’), all safe patterns. To this pur- 
pose, selection can be used as follows: a us ( p , p .d,ij. m ,v)Au;T(p,'now')( c ) 3 - As another 
example, retrieval of the patterns belonging to a certain class c, which are tem- 
porally valid in a given interval of time (e.g. a certain year), can be specified as 
follows: <j v t during [01- JAN -03,31- DEC -03)( c )- 

Cross-over Operators. They correlate patterns with raw data, providing a 
way for navigating from the pattern layer to the raw data layer and vice versa. 
Drill- Through 7. It allows one to retrieve the subset of source data associated 
with at least one pattern in a class c, satisfying condition cond : 

7 (c, cond) = {x\3p £ c, cond{c) = true, x £ p.d}. 

Data Covering 0d ■ Let p be a pattern of type pt, D a data source, p a min- 
ing function for patterns of type pt, and v =< v\,...,v n > some user-specified 
thresholds. Data covering allows us to determine the subset of source data repre- 
sented by at least one pattern in the class. To this purpose, the formula is used: 
9d{c, D, p) = {x\x £ D,3p £ c,p.f(x) = true}. 

Cross-over selection. When using a cross-over predicate within a selection, we 
need to access raw data to execute the query. For example, suppose that c is a 
class of association rules and D a dataset suitable for patterns in c. The query 
o Dc. e d,Asupport>o .e^c) returns all rules in c representing a superset of D, with a 
support greater than 0.6. 

3 p denotes a generic pattern in c. 
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Example 7. Consider Example 1. Suppose that from April 2005, the vendor will 
start to sell a certain product P and he wants to know how P can be promoted. 
To do that, he looks for a correlation between P and some other items sold. 
With such information, he may activate an advertising campaign to promote 
some other product he already sells in order to stimulate the demand for P. 
In this way, when he starts to sell P, probably customers will start to buy it 
without the need for a dedicated advertising campaign. A possible approach 
could be the following. First of all it determines in which cluster of products 
P belongs and gets the representative P, by using a selection and a projection 
operator (Fig. 3 line 18). Then, he determines which products stimulate the sale 
of R by considering bodies of association rules having R in their head. This result 
can be achieved by performing a join operation (Fig. 3 line 19) between patterns 
just retrieved and association rules already mined. We assume that the used 
composition function returns patterns representing the bodies of the selected 
association rules. Products in association rule bodies are such that whenever a 
customer buys one of them, with a high probability he buys also R. Since P is 
in the cluster represented by R , P and R are similar with respect to customer 
preferences, thus it is most likely that when the vendor starts to sell P, customers 
will behave as for R. When, on April 1 2005, the vendor starts to sell P, new data 
are collected and patterns previously extracted may become unreliable. Thus, a 
synchronization is required between data and patterns (Fig. 3 lines 20-22). □ 



5 Related Work 

Several approaches have been proposed to model patterns. Among standard- 
ization efforts for modeling patterns, we recall the Predictive Model Markup 
Language (PMML) [4], the ISO SQL/MM standard [3], and the Common Ware- 
house Model (CWM) framework [2]. Although these approaches represent a wide 
range of data mining results, they do not provide a generic model to handle ar- 
bitrary pattern types. Furthermore, their main purpose is to enable an easy 
interchange of metadata not their effective manipulation. 

In inductive databases, data and patterns are stored and queried together [1, 
7, 8, 12] . They rely on specific (but extensible) types of patterns and are primarily 
focused on a-posteriori patterns. Moreover, validity is not considered. Within 
this framework, the entire knowledge discovering process is a querying process 
[12-14]. However, new SQL-based operators do not allow the user to specify 
specific mining functions [9]. Moreover, none of the proposed languages deals 
with pattern validity and synchronization aspects. 

In [11], the authors propose a unified algebraic framework for multi-step 
knowledge discovery. Similarly to our approach, they model different types of 
patterns and maintain data and patterns separated. However, temporal infor- 
mation is not considered. 

Previous work strictly related to the work presented here has been reported 
in two previous papers by us and other authors [6, 15] , where a model for patterns 
and a pattern query language have been proposed. The major differences between 
that work and the one presented here is the extension with temporal features. We 
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believe that this is relevant extension from both a theoretical and architectural 
point of view. 

6 Concluding Remarks 

In this paper, we presented a general framework for patterns representation and 
management, taking into account validity information, a-priori and a-posteriori 
patterns. The resulting framework seems general enough to cope with real data 
mining applications. We are currently working on the development of a prototype 
of the proposed framework. Future work includes the definition of a pattern 
calculus, equivalent to the proposed algebra, the analysis of their expressive 
power and complexity, and the comparison of the expressive power of existing 
approaches to deal with patterns. We also plan to further investigate semantic 
validity to extend temporal analysis capabilities for patterns. 
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Abstract. In this paper we propose a novel spatial associative classifier method 
based on a multi-relational approach that takes spatial relations into account. 
Classification is driven by spatial association rules discovered at multiple 
granularity levels. Classification is probabilistic and is based on an extension of 
naive Bayes classifiers to multi-relational data. The method is implemented in a 
Data Mining system tightly integrated with an object relational spatial database. 
It performs the classification at different granularity levels and takes advantage 
from domain specific knowledge in form of rules that support qualitative spatial 
reasoning. An application to real-world spatial data is reported. Results show 
that the use of different levels of granularity is beneficial. 



1 Introduction 

The rapidly expanding amount of spatial data gathered by collection tools, such as 
satellite systems or remote sensing systems have paved the way for advances in spa- 
tial data structures [12], spatial reasoning [8] and computational geometry [23] to 
serve multiple tasks including storage and sophisticated treatment of real-world ge- 
ometry in a spatial database. A spatial database contains (spatial) objects that are 
characterized by a geometrical representation (e.g. point, line, and region in a 2D 
context) as well as several non-spatial attributes. The widespread use of spatial data- 
bases in real-world applications (e.g geo-marketing or environmental analysis) is 
leading to an increasing interest in Spatial Data Mining, i.e. in mining interesting and 
useful but implicit knowledge. Classification of spatial objects is a fundamental task 
in Spatial Data Mining, where training data consists of multiple target spatial objects 
(primary data), possibly spatially-related with other non-target spatial objects (secon- 
dary data). The goal is to learn the concept associated with each class on the basis of 
the interaction of two or more spatially-referenced objects or space-dependent 
attributes, according to a particular spacing or set of arrangements [15]. 

While a lot of research has been conducted, both in propositional and multi- 
relational setting, on mining classification models from data eventually stored in mul- 
tiple tables of a relational database, only a few works deal with classification models 
to be discovered in spatial database. Indeed, mining spatial classification models 
presents two main sources of complexity, that is, the implicit definition of spatial 
relations and the granularity of the spatial objects. The former is due to the fact that 
the geometrical representation (e.g. point, line, and region in a 2D context) and the 
relative positioning of spatial objects with respect to some reference system, define 
implicitly spatial relations of different nature, such as directional and topological. 
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Modeling these spatial relations is a key challenge in classification problems that arise 
in spatial domains [24], Indeed, both the attribute values of the object to be classified 
and the attribute values of spatially related objects may be relevant for assigning an 
object to a class from a given set of classes. The second source of complexity refers to 
the fact that spatial objects can be described at multiple levels of granularity. For 
instance, UK census data can be geo-referenced with respect to the hierarchy of areal 
objects: 

ED — > Ward — > District — > County, 

based on the inside relationship between locations. Therefore, some kind of 
taxonomic knowledge of task-relevant geographic layers may also be taken into 
account to obtain descriptions at different granularity levels ( multiple-level 
classification). 

In this paper we propose a novel spatial classification method based on a multi- 
relational approach that takes spatial relations into account. Classification is probabil- 
istic and is based on the extension of naive Bayes classifiers to multi-relational data. 
Classification rules are automatically generated by means of a spatial association rule 
discovery system characterized by the capability of generating association rules at 
multiple levels of granularity. In this way, the proposed method can deal with both 
sources of complexity presented above. The proposed method has been implemented 
in a Data Mining system tightly integrated with an object-relational spatial database. 
It can perform the classification at different levels of granularity and takes advantage 
from domain specific knowledge expressed in form of rules to support qualitative 
spatial reasoning. Finally, it handles categorical as well as numerical data through a 
contextual discretization method. 

The paper is organized as follows. In the next section we discuss the background of 
this research and some related works. The mining of multi-level spatial association 
rules for classification purpose is presented in Section 3 while the multi-relational 
Naive Bayes classification is described in Section 4. Section 5 describes the system 
architecture. Finally, an application is presented in Section 6 and some conclusions 
are drawn. 



2 Background and Motivations 

The problem of classifying spatial objects has been investigated by some researchers. 
Ester et al. [10] proposed a neighbourhood graph based extension of decision trees 
that considers both non-spatial attributes of the classified objects and relations with 
neighbouring objects. However, the proposed method does not take into account hier- 
archical relations defined on spatial objects as well as non-spatial attributes (e.g. 
number of residents) of neighbouring objects. In contrast, Kopersky [ 15] described an 
efficient method that classifies spatial objects by considering both spatial and hierar- 
chical relations between spatial objects and takes into account non-spatial attributes 
for neighbouring objects. However this method suffers from severe limitations due to 
the restrictive representation formalism known as single-table assumption [26]. More 
specifically, it is assumed that data to be mined are represented in a single table of a 
relational database, such that each row (or tuple) represents an independent unit of the 
sample population and columns correspond to properties of units. This requires that 
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non-spatial properties of neighboring objects be represented in aggregated form caus- 
ing a consequent loss of information and a change in the units of analysis. 

In [20], the authors proposed to exploit the expressive power of predicate logic to 
represent both spatial relations and background knowledge, such as spatial hierar- 
chies. In addition the logical notions of generality order and of downward refinement 
operator on the space of patterns may be profitably used to define both the search 
space and the search strategy. For this purpose, the ILP system ATRE [21] has been 
integrated in the data mining server of a prototypical Geographical Information Sys- 
tem (GIS), named INGENS, which allows, among other things, to mine classification 
rules for geographical objects stored in an object-oriented database. Training is based 
on a set of examples and counterexamples of geographic concepts of interest to the 
user (e.g., ravine or steep slopes). The first-order logic representation of the training 
examples is automatically extracted from maps, although it is still controlled by the 
user who can select a suitable level of abstraction and/or aggregation of data by 
means of a data mining query language [19]. 

Similarly, the discovery of spatial association rules, that is spatial and a-spatial re- 
lationships among spatial objects, has been investigated both in propositional and 
multi-relational setting. A spatial association rule is a rule of the form “P — >Q (s, c)” 
such that both P (body ) and Q (head) are sets of literals, some of which refer to spatial 
properties, and Pr\Q = 0. P'jQ is named pattern. The support s estimates the prob- 
ability p(Pu0, while the confidence c estimates the probability p(<2|P). 

Koperski and Han [14] implemented the module Geo-associator of the spatial data 
mining system GeoMiner that mines rules from data represented in a single relation 
(table) of a relational database. In contrast, in [16], the authors proposed an ILP ap- 
proach to spatial association rules discovery. The algorithm SPADA (Spatial Pattern 
Discovery Algorithm), reported in their work, allows the extraction of multi-level 
spatial association rules, that is, association rules involving spatial objects at different 
granularity levels. SPADA has been implemented as a module of the system ARES 
(Association Rules Extractor from Spatial data) [2], which also supports users in the 
complex processes of extracting spatial objects from the spatial database, specifying 
the background knowledge on the application domain and defining a search bias. 

Despite the fact that spatial association rule mining is a descriptive task, while 
classification of spatial objects is a predictive task, recent studies in Data Mining and 
Machine Learning have investigated the opportunity of combining association rules 
discovery and classification, by taking advantage of employing association rules for 
classification purpose [6, 3]. This approach is named associative classification [17] 
and several advantages are reported in the literature for this approach. First, differ- 
ently from most of classifiers as decision trees, association rules consider the simulta- 
neous correspondence of values of different attributes, hence allowing to achieve 
better accuracy [3]. Second, it makes association rule mining techniques applicable to 
classification tasks. Third, the user can decide to mine both association rules and a 
classification model in the same data mining process [17]. Fourth, the associative 
classification approach helps to solve understandability problems [4, 25] that may 
occur with some classification methods. Indeed, many rules produced by standard 
classification systems are difficult to understand because these systems often use only 
domain independent biases and heuristics, which may not fulfil user’s expectation. 
With the associative classification approach, the problem of finding understandable 
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rules is reduced to a post-processing task [17]; filtering based on user-defined rule 
template may help in extracting understandable rules. 

Although associative classification methods present several interesting aspects, 
they also suffer from some limitations. First, most of methods reported in the litera- 
ture work under the single-table assumption, which is a strong limitation in those 
application domains characterized by a spatial dimension. Second, they have a cate- 
gorical output which convey no information on the potential uncertainty in classifica- 
tion. Small changes in the attribute values of an object being classified may result in 
sudden and inappropriate changes to the assigned class. Missing or imprecise infor- 
mation may prevent a new object from being classified at all. In alternative, to over- 
come these deficiencies, we propose to use a probabilistic classifier that returns, in 
addition to the result of the classification, the confidence of the classification. This is 
an important aspect because of the increasing attention on the ROC curve analysis 
[11] that defines an evaluation measure to take into account the confidence of the 
classification. Third, reported methods require additional heuristics to identify the 
most effective rule at classifying a new object. Alternatively, in the proposed ap- 
proach, the evaluation of the class is based on the computation of probabilities taking 
into account all the rules. 



3 Multi-level Spatial Association Rules 

In [2] the problem of mining spatial association rules has been formalized as follows: 

Given a spatial database (SDB), a set S of reference objects tagged with a class la- 
bel Cj g {CpC-,,..., C L }, some sets R k , 1 <k<m, of task-relevant objects, a background 
knowledge BK including some spatial hierarchies H k on objects in R k , M granularity 
levels in the descriptions ( 1 is the highest while M is the lowest), a set of granularity 
assignments \|/ A which associate each object in H k with a granularity level, a couple of 
thresholds minsup[l\ and minconfil ] for each granularity level, a language bias LB that 
constrains the search space; 

Find strong multi-level spatial association rules, that is, association rules involving 
spatial objects at different granularity levels. 

The reference objects are the main subject of the description, that is, the 
observation units, while the task relevant objects are spatial objects that are relevant 
for the task in hand and are spatially related to the former. The sets R k typically 
correspond to layers of the spatial database, while hierarchies H k define is-a (i.e., 
taxonomical) relations of spatial objects in the same layer (e.g. river is-a water body). 
Objects of each hierarchy are mapped to one or more of the M user-defined 
description granularity levels in order to deal uniformly with several hierarchies at 
once. Both frequency of patterns and strength of rules depend on the granularity level 
l at which patterns/rules describe data. Therefore, a pattern P ( s% ) at level l is 
frequent if s>minsup[l\ and all ancestors of P with respect to If are frequent at their 
corresponding levels. An association rule Q —> R (s%, c%) at level / is strong if the 
pattern Q'jR (s%) is frequent and c> minconfil]. 

The problem above is solved by the algorithm SPADA [16] that operates in three 
steps for each granularity level: i) pattern generation; ii) pattern evaluation; iii) rule 
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generation and evaluation. SPADA takes advantage of statistics computed at granular- 
ity level / when computing the supports of patterns at granularity level /+ 1 . 

In the system ARES (http://www.di.uniba.it/~malerba/software/ARES/index.htm) 
SPADA has been loosely coupled with a spatial database, since data stored in the 
SDB Oracle Spatial are pre-processed and then represented in a deductive database 
(DDB). For instance, spatial intersection between two objects X and Y is represented 
by the extensional predicate crosses(X,Y). In this way, the expressive power of first- 
order logic in databases is exploited to specify both the background knowledge BK, 
such as spatial hierarchies and domain specific knowledge, and the language bias LB. 
Spatial hierarchies allow to face with one of the main issues of spatial data mining, 
that is, the representation and management of spatial objects at different levels of 
granularity, while the domain specific knowledge stored as a set of rules in the inten- 
sional part of the DDB supports qualitative spatial reasoning. On the other hand, the 
LB is relevant to allow the user to specify his/her bias for interesting solutions, and 
then to exploit this bias to improve both the efficiency of the mining process and the 
quality of the discovered rules. In SPADA, the language bias is expressed as a set of 
constraint specifications for either patterns or association rules. Pattern constrains 
allow to specify a literal or a set of literals that should occur one or more times in 
discovered patterns. During the rule generation phase, patterns that do not satisfy a 
pattern constraint are filtered out. Similarly, rule constraints are used do specify liter- 
als that should occur in the head or body of discovered rules. 

In a more recent release of SPADA (3.1) a new rule constraint has been introduced 
in order to specify the maximum number of literal that should occur in the head of a 
rule. In this way users may define the head structure of a rule requiring the presence 
of exactly a specific literal and nothing more. In the case this literal describes the 
class label, multi-level spatial association rules discovered by ARES may be used for 
classification. 



4 Naive Bayes Classification 



Once a set of rules has been extracted for each level, it is used in the construction of a 
naive Bayesian classifier [5], which aims to classify any target object oeS by maxi- 
mizing the posterior probability P( C ( |o) that o is of class C;, that is: 

class(o)= arg max i P( C ( |o) 

By applying the Bayes theorem, P(C\o) can be reformulated as follows: 



P(Cilo) = WMCj) 

P(o) 



( 1 ) 



The term P(o|C ; ) is estimated by means of the naive Bayes assumption: 
P(o\C i )=P(o 1 ,o 2 ,... ,oJC,.)=P( 0( .|C,.) xP(o 2 \C j ) x...xP(o m \C i ) 
where o 1 ,o 2 ,...,o m represent the set of the properties, different from the class, used to 
describe the object. This assumption is clearly false if the predictor variables are sta- 
tistically dependent. However, even in this case, the naive Bayesian classifier can give 
good results [5]. 
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In ( 1 ) the value P(Cj) is the prior probability of the class Cj, Since P(o) is inde- 
pendent of the class C„ it does not affect f(o), that is, 

class(o)= arg maxj P(C i )P(o\C i ) (2) 

However, this formulation of the problem holds in the single-table assumption data 
representation formalism, where an object represents an independent unit of the sam- 
ple population described by means of a set of properties. In the multi-relational setting 
[7], the target object is related to other non-target objects. In order to take into account 
the relations of the target object, a modification of the problem formulation is neces- 
sary. For this purpose, a key role is played by the extracted association rules. In par- 
ticular, the idea is to consider the set of rules to guide the computation of P(o\C i ). 

Given the object oeS, we consider the subset of the extracted rules that can be used 
to classify o. More formally, we consider the subset R of rules whose body is satisfied 
by the object to be classified both in terms of the values of properties of involved 
spatial objects and in terms of the spatial relations between objects. For example, if S 
is the set of wards in a district, a ward w satisfies the rule: 

mortality _rate( A, low) wards_relatedTo_waters(A, B), 

waters_typewater(B, river), cars _per_person(A, high) 

when w is spatially related (intersects) to a river and is characterized by a high aver- 
age number of cars per person. 

We use R to estimate P(o\Cj). In particular, we estimate P(o\Cj) by means of the 
probabilities associated to both spatial relations (e.g. wards_relatedTo_waters(A,B)) 
and properties (e.g. waters typewater(B, RIVER), cars_per_person(A,high)) associ- 

ated to each rule in R. 

For instance, if R = { R r IP }, where Rj and IP are two association rules of class C ; 
extracted by SPADA: 

R r A, o : -A,i> A, 2 R r A.o : -A,i. A, 2 

where p i and p ^ are spatial relations, p o and p n o are properties and p 0 = p 2 (class) 
then FOR,, R 2 }| Cj- ) = P( p io n /?, , n A.i n A, 2 n J3 2a \ C , ) = 

P(A,o ^ A,i ^ A.i IGF p(A, 2 ^ A. 2 1 A,o ^ A.i ^ A,i ^ G) 

The first term takes into account the relations of the rules while the second term refers 
to the conditional probability of satisfying the property predicates in the rules given 
the relations. By means of the naive Bayes assumption, the probabilities can be factor- 
ized as follows: 

P( A,o ri A.i n A., I G ) = P( A, I G ) ■ ■ P( A.1 1 G ) 

P(A.2 r> A. 2 1 A.o n A, n A.1 r. G) = « A. 2 1 A, nAA c, ) • p(A , 2 I A, ^AaG) 

Since p and ^ do not depend from [j^ i and p [ respectively, then: 

P(A,2 n A, 2 1 A,o n A.i n A,, n G ) = P(A , 2 1 A., n G ) • P(A . 2 IAa n G) 

By generalizing to a set of rules we have: 

P(G)P( 0 |G) = AG)II (PirelationSfr \ C^Y^Plproperty^ j\relations k ,Cj )) 

*e|*| j 



( 3 ) 
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where the term relations k represents the event that the set of spatial relations ex- 
pressed in the k- th rule is satisfied, while the term property k - represents the event that 
the /-th property of the k- th rule is satisfied. 

If relations k = { reIation(Set p Set 2 ) \ Set p Set 2 e {Sjvj{R k , 1 <k<m}, Set^ Set-, } is a 
set of binary relations between spatial objects (either task relevant or reference) in- 
volved in the A:-th rule, the probability P( relations k \Cj) is computed by means of the 
naive Bayes assumption: 

p ( relations k \C f ) = ]"[ P(relation(Set h , Settle,) 

/e | relations k \ 

where: 

I relationCSet' , ,Set\ )| <a\ 

P{relation{Setj ,Set t )|C.) = P(relation(Set' , ,Set' , )) = ! - — ' 

1 2 1 2 I Set' h | • | Set\ | 

where Set’ j is a subset of objects in Set t that are related, by means of spatial relations, 

with objects in S of class Cj, while | relation(Set' , , Set', ) | is the number of relations 

between objects of Set', and objects of Set', ■ 

n l 2 

To compute the probability P(property k ] \relations k , C ; ) in (3), we use the Laplace 



estimation: 



Pfproperty k ^relations k ,C i ) = 



| relations k a property k . a C, | +1 
| relations k a C t \ +F 



(5) 



where F is the number of possible admissible values of the property. Laplace’s esti- 
mate is used in order to avoid null probabilities in equation (2). In practice, the value 
at the nominator is the number of target objects of class C ( that are related to other 
spatial objects by means of spatial relations expressed in relations k and for which 
property k j is satisfied. The value of the denominator is the number of target objects of 
class Cj that are related to other spatial objects by means of spatial relations expressed 
in relations k plus F. 

In order to avoid the problem that the same relation or the same property is consid- 
ered more than once in the computation of probabilities in formula (3), the values 
computed in formula (4) and (5) are effectively determined and included in formula 
(3) only if the values have not been computed before. 



5 A Spatial Associative Classification Framework 

The integration of multi-level spatial association rules discovery with naive Bayesian 
classification is realized in a spatial associative classification system based on a 
client-server model (see Fig. 1). Both the spatial association rule miner SPADA and 
the multi-relational naive Bayes classifier are on the server side, so that several data 
mining tasks can be run concurrently by multiple users. SPADA fully exploits the 
flexibility of ILP to specify the background knowledge BK (i.e hierarchies and do- 
main specific knowledge) as well as the language bias LB (i.e, search constraints). 
Hierarchies are expressed by a collection of ground atoms and represent spatial ob- 
jects at different granularity level while domain specific knowledge is expressed as 
sets of definite clauses and support a spatial qualitative reasoning. Conversely, search 
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CLIENT SIDE 



DM SERVER SIDE 




Fig. 1 . Spatial associative classification system. 



constraints are used to bias the search in order to fulfil user expectations. In this 
framework, constraints are also used to partially fix the structure of extracted rules in 
order to discover spatial association rules that contain only the class label in the head. 
For each granularity level, extracted rules concur in building the spatial classification 
model by exploiting a multi-relational naive Bayesian classifier integrated with the 
SDB. 

On the client side, the framework includes a Graphical User Interface (GUI), which 
provides users with facilities for controlling all parameters of the mining process. 

SPADA, like many other association rule mining algorithms, cannot process nu- 
merical data properly, so it is necessary to perform a discretization of numerical fea- 
tures with a relatively large domain. For this purpose, the framework includes in the 
client side the module RUDE (relative unsupervised discretization algorithm) which 
discretizes a numerical attribute of a relational database in the context defined by 
other attributes [18]. 

The SDB (Oracle Spatial) can run on a third computation unit. Many spatial fea- 
tures (relations and attributes) can be extracted from spatial objects stored in the SDB. 
Feature extraction requires complex data transformation processes to make spatial 
relations explicit and representable as ground Prolog atoms. Therefore, a middle layer 
module, named FEATEX (Feature Extractor), is required to make possible a loose 
coupling between SPADA and the SDB by generating features of spatial objects 
(points, lines, or regions). The module is implemented as an Oracle package of proce- 
dures and functions, each of which computes a different feature [2], Transformed data 
are also stored in SDB tables. 



6 The Application: Mining North West England Census Data 

In this section we present a real-world application concerning the mining of both 
spatial association rules and classification models for geo-referenced census data 
interpretation. We consider both census and digital map data provided in the context 
of the European project SPIN! (Spatial Mining for Data of Public Interest) [22]. They 
concern Greater Manchester, one of the five counties of North West England (NWE). 
Greater Manchester is divided into ten metropolitan districts, each of which is decom- 
posed into censual sections or wards, for a total of two hundreds and fourteen wards. 
Spatial analysis is enabled by the availability of vectorized boundaries of the 1998 
census wards as well as by other Ordnance Survey digital maps of NWE, where sev- 
eral interesting layers are found, namely road net, rail net, water net, urban area and 
green area (see Table 1). 
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Table 1. Geographic layers. 



\ Layer name 


Geometry 


Road net 


A-road: B-road; Motorway; Primary road 


Line 


Rail net 


Railway 


Line 


Urban area 


Large urban area; Small urban area 


Line 


Green area 


Wood; Park: 


Line 


Water net 


Water; River; Canal 


Line 


Greater Manchester Ward 


Ward 


Region 



Census data are available at ward level. They provide socio-economic statistics 
(e.g. mortality rate, that is, the percentage of deaths with respect to the number of 
inhabitants) as well as some measures describing the deprivation level. Indeed, the 
material deprivation of an area may be estimated according to information provided 
by Census combined into single index scores [1], Over the years different indices 
have been developed for different applications: the Jarman Underprivileged Area 
Score was designed to measure the need for primary care, the indices developed by 
Townsend and Carstairs have been used in health-related analyses, while the Depart- 
ment of the Environment's Index (DoE) has been used in targeting urban regeneration 
funds. Thereby, we have considered the values of Jarman index, Townsend index, 
Carstairs index and DoE index. The higher the index value the more deprived a ward 
is. Both index values as well as mortality rate are all numeric and have been 
discretized by means of RUDE. More precisely, Jarman index, Townsend index, DoE 
index and Mortality rate have been automatically discretized in (low, high), while 
Carstairs index has been discretized in (low, medium, high). 

For this application, we have considered Greater Manchester wards as reference 
(target) objects. In particular, three different experimental settings have been analysed 
by varying the target property among mortality rate, Jarman index and DoE index. We 
have chosen Jarman and DoE indices because they are defined on the basis of differ- 
ent social factors. For each setting, we have focused our attention on investigating 
dependencies between the target property and socio-economic factors represented in 
census data as well as geographical factors represented in linked topographic maps. 
These dependencies are detected in form of spatial association rules having only the 
target property in the head. Rules in this form may be employed for spatial subgroup 
mining, that is, discovery of interesting groups of spatial objects with respect to a 
certain property of interest [13] as well as for classification purpose. 

For this analysis, we have formulated queries involving the FEATEX relate func- 
tion to compute topological relationships between reference objects and task relevant 
objects. For instance, a relationship extracted by FEATEX is crosses(ward_135, ur- 
bareaL_151), where ward_# denotes a specific Greater Manchester ward, while ur- 
hanareaL# refers to a large urban area crossing the interested ward. The topological 
relationship crosses is computed according to the 9-intersection model [9]. The num- 
ber of computed relationships is 784,107. 

To support a spatial qualitative reasoning, a domain specific knowledge (BK) has 
been expressed in form of a set of rules. Some of these rules are: 

crossed_by_urbanarea(X,Y) connects(X,Y), is_a(Y, urban_area). ... 

crossed_by_urbanarea(X,Y) inside(X,Y), is_a(Y, urban_area). 
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Here the use of the predicate is_a hides the fact that a hierarchy has been defined 
for spatial objects which belong to the urban area layer. In detail, five different 
hierarchies have been defined to describe the following layers: road net, rail net, water 
net, urban area and green area (see Fig. 2). The hierarchies have depth three and are 
straightforwardly mapped into three granularity levels. They are also part of the BK. 



road_net 



2- i— a_roacf' b_roa^rimary_road 

3- j— ► < ► ◄ ►< 



water net urban_area 

iotorway water canal river large urban area smalr urban area 

AAA 



green_area 



► -* 






Fig. 2. Spatial hierarchies defined for road net, water net, urban area and green area. 



Finally, we have specified a language bias (LB) both to constrain the search space 
and to filter out uninteresting spatial association rules. In particular, we have ruled out 
all spatial relations (e.g. crosses, inside, and so on) directly extracted by FEATEX and 
asked for rules containing topological predicates defined by means of BK. Moreover, 
by combining the rule filters head_constraint( [mortality _rate(_), 1,1) and 
rule_head_length(l,l ) we have asked for rules containing only mortality rate in the 
head. Similar considerations apply to the classification tasks concerning the Jarman 
and the DoE indices. In addition, we have specified the maximum number K of re- 
finement steps (i.e. number of literals in the body of rules). 

For each setting, a ten-fold cross validation has been performed and results are 
evaluated. For instance, by analyzing spatial association rules extracted with parame- 
ters minsup = 0.1, minconf = 0.6 we discover the following rule: 

mortality _rate( A, high) is_a(A, ward), crossed_by_urbanarea(A, B ), 

is_a(B, urban_area), townsendidx_rate(A, high) (40.72%, 72.47%) 

which states that a high mortality rate is observed in a ward A that includes an urban 
area B and has a high value of Townsend index. The support (40.72%) and the high 
confidence (72.47%) confirm a meaningful association between a geographical factor, 
such as living in deprived urban areas, and a social factor, such as the mortality rate. It 
is noteworthy that SPADA generates the following rule: 

mortality _rate( A, high) <— is_a(A,ward), crossed_by_urbanarea(A,B), 

is_a(B, urban_area) (56.7%, 60.77%) 

which has a greater support and a lower confidence. These two association rules show 
together an unexpected association between Townsend index and urban areas. Appar- 
ently, this means that this deprivation index is unsuitable for rural areas. 

At a granularity level 2, SPADA specializes the task relevant object B by generat- 
ing the following rule which preserves both support and confidence: 

mortality _rate( A, high) <—is_a(A, ward), crossed_by_urbanarea(A, B), 

is_a(B, urban_areaL), townsendidx_rate(A,high ) (40.72%, 72.47%) 

This rule clarifies that the urban area B is large. 

The average predictive accuracy of mined multi-level spatial classification model 
is evaluated by varying minsup , minconf and K for each setting,. Results are reported 
in Table 2, 3 and 4. In the first setting, results show that, predictive accuracy of the 
Bayesian classifier is slightly better than the accuracy (0.567) of the trivial classifier 
that returns the most probable class. We explain this result with the inherent complex- 
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Table 2. Mortality Rate average accuracy. 



| MORTALITY Avg. Accuracy 


K=4 


K=5 


K=6 


K=7 


minsup=0.1 

minconf=0.6 


Level=l 


0.5932 


0.5915 


0.5932 


0.628 


Level=2 


0.5932 


0.596 


0.5932 


0.628 


minsup=0.2 

minconf=0.65 


Level=l 


0.5932 


0.602 


0.5932 


0.623 


Level=2 


0.5932 


0.602 


0.5932 


0.623 



Table 3. Jarman average accuracy. 



\j ARMAN Avg. Accuracy 


K=4 


K=5 


K=6 


K=7 


minsup=0.1 

minconf=0.6 


Level =1 


0.8176 


0.8176 


0.8176 


0.8176 


Level =2 


0.8176 


0.8176 


0.8176 


0.8176 


minsup=0.2 

minconf=0.8 


Level =1 


0.528 


0.528 


0.528 


0.528 


Level =2 


0.528 


0.528 


0.6272 


0.6705 



Table 4. DoE average accuracy. 



| DoE Avg. Accuracy 


K=4 


K=5 


K=6 


K=7 


minsup=0.1, 

minconf=0.6 


Level- 1 


0.912 


0.912 


0.912 


0.912 


Level=2 


0.912 


0.912 


0.912 


0.912 


minsup=0.2, 

minconf=0.8 


Level=l 


0.875 


0.875 


0.875 


0.821 


Level=2 


0.875 


0.9028 


0.883 


0.874 



ity of the task. Different conclusions can be drawn from both Jarman and DoE results, 
where the Bayesian classifiers significantly improve the trivial classifiers (acc. 0.542 
and 0.625, respectively). Another consideration is that the average predictive accura- 
cies of classification models discovered at higher granularity levels (i.e. level=2) are 
always better or equal to the corresponding accuracies at lowest levels. This means 
that the classification model takes advantage of the use of the hierarchies defined on 
spatial objects. Furthermore, results show that by decreasing the number of extracted 
rules (higher support and confidence) we have lower accuracy. This means that there 
are several rules that strongly influence classification results and often such rules are 
not characterized by high values of support and confidence. Finally, we observe that, 
generally, the higher the number of refinement steps, the better the model. 



7 Conclusions 

In this paper we have presented a spatial associative classifier that combines spatial 
association rule discovery with naive Bayes classification. Domain specific knowl- 
edge may be defined as a set of rules that makes possible the qualitative spatial rea- 
soning. In addition, hierarchies on spatial objects are expressed by a collection of 
ground atoms and are exploited to mine classification models at different granularity 
levels. Search constraints are used to bias the spatial association rules discovery in 
order to fulfil user expectations. In particular, constraints are also used to partially fix 
the structure of extracted rules in order to discover spatial association rules that con- 
tain only the class label in the head. Finally, for each granularity level, extracted rules 
concur in building the spatial classification model by exploiting a multi-relational 
naive Bayesian classifier integrated with the SDB. 
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Experiments on real-world spatial data show that the use of different levels of 
granularity generally increases the accuracy of the mined classification model. As 
future work, we intend to frame the work within the context of hierarchical Bayesian 
classifiers, in order to exploit the multi-level nature of extracted association rules. 
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Abstract. Graphs arise in numerous applications, such as the analysis 
of the Web, router networks, social networks, co-citation graphs, etc. Vir- 
tually all the popular methods for analyzing such graphs, for example, 
k- means clustering, METIS graph partitioning and SVD/PCA, require 
the user to specify various parameters such as the number of clusters, 
number of partitions and number of principal components. We propose 
a novel way to group nodes, using information-theoretic principles to 
choose both the number of such groups and the mapping from nodes 
to groups. Our algorithm is completely parameter-free, and also scales 
practically linearly with the problem size. Further, we propose novel al- 
gorithms which use this node group structure to get further insights into 
the data, by finding outliers and computing distances between groups. Fi- 
nally, we present experiments on multiple synthetic and real-life datasets, 
where our methods give excellent, intuitive results. 



1 Introduction Motivation 

Large, sparse graphs arise in many applications, under several guises. Conse- 
quently, because of their importance and prevalence, the problem of discovering 
structure in them has been widely studied in several domains, such as social 
networks, co-citation networks, ecological food webs, protein interaction graphs 
and many others. Such structure can be used for getting insights into the graph, 
for example, for detecting “communities”. 

Problem Description: A graph G{V . , E) has a set E of edges connecting any 
pair of nodes from a set V. Our definition includes both directed and undirected 
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sylvania Infrastructure Technology Alliance (PITA) Grant No. 22-901-0001, and by 
the Defense Advanced Research Projects Agency under Contract No. N66001-00- 
1-8936. Additional funding was provided by donations from Intel, and by a gift 
from Northrop-Grumman Corporation. Any opinions, findings, and conclusions or 
recommendations expressed in this material are those of the author(s) and do not 
necessarily reflect the views of the National Science Foundation, or other funding 
parties. 
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(c) Springer- Verlag Berlin Heidelberg 2004 




AutoPart: Parameter-Free Graph Partitioning and Outlier Detection 



113 



graphs. We want algorithms that discover structure in such datasets, and provide 
insights into them. Specifically, our goals are: 

(Gl) Clusters: “ Similar” nodes should be grouped into “natural” clusters. 
(G2) Outliers: Edges deviating from the overall structure should be tagged 
as outliers. 

(G3) Inter-cluster Distances: For any pair of clusters, a measure of the 
“distance” between them should be defined. 

In addition, the algorithms should have the following main properties: 

(PI) Automatic: We want a principled and intuitive problem formulation, 
such that the user does not need to set any parameters. 

(P2) Scalable: They should scale up for large, possibly disk resident graphs. 
(P3) Incremental: They should allow online recomputation of results when 
new nodes and edges are added; this will allow the method to adapt to new 
incoming data from, say, web crawls. 

In this paper, we propose algorithms to accomplish these objectives. Intu- 
itively, we seek to group nodes so that the adjacency matrix is divided into 
rectangular/square regions as “similar” or “homogeneous” as possible. These 
regions of varying density would succinctly summarize the underlying structure 
of associations between nodes. In short, our method will take as input a matrix 
like in Figure 3(a) and produce Figure 3(g) as the output, without any human 
intervention. 

The layout of the paper is as follows. In Section 2, we survey the related 
work. Subsequently, in Section 3, we formulate our data description model start- 
ing from first principles. Based on this, in Section 3.3 we outline a two-level 
framework to find homogeneous blocks in the adjacency matrices of graphs and 
develop an efficient, parameter-free algorithm to discover them. In Section 3.5, 
we use this structure to find outlier edges in the graph, and to calculate distances 
between node groups. In Section 4, we evaluate our algorithms, demonstrating 
good results on several real and synthetic datasets. Finally, we conclude in Sec- 
tion 5. 

2 Survey 

There has been quite a bit of work on graph partitioning. The prevailing methods 
are METIS [1] and spectral partitioning [2], Both approaches have attracted a 
lot of interest and attention; however, both need the user to specify k, that is, the 
number of pieces the graph should be broken into. Moreover, they typically also 
require a measure of imbalance between the two pieces of each split. The Markov 
Clustering [3] method uses random walks, but is slow. Girvan and Newman [4] 
iteratively remove edges with the highest “stress” to eventually find disjoint 
communities, but the algorithm is again slow. Flake et al. [5] use the max- 
flow min-cut formulation to find communities around a seed node; however, the 
selection of seed nodes is not fully automatic. 
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Table 1 . Table of symbols. 



Symbol 


Definition 


D 


Square binary adjacency matrix of a given graph 


dij 


Entry in cell ( i,j ) of D ; dij := 0 or 1 


n 


Length of each side of D 


k 


Number of node groups 


k* 


Optimal number of groups 


g 


Node — > group map 


Qx 


Group corresponding to node x 


D>-:i 


Submatrix of links from group i to j 


CLi 


Number of nodes in group i 


O'!) 


Dimensions of D, j 




Number of elements in Dij; n(Dij) := amj 


w{Di,j) 


Weight of Dij = number of “l”s in Di,j 


Pi,j 


Density of “l”s in Dij; Pij := w(Dij) /n(Dij) 


H(p ) 


Binary Shannon entropy function 


C(Dij) 


Code cost for D % .j 


T{D\k,g) 


Total cost for D 



Remotely related are clustering techniques. Every row in the adjacency ma- 
trix can be envisioned as a multi-dimensional point. Several methods have been 
developed to cluster a cloud of n points in m dimensions, for example, fc-means, 
fc-harmonic means, CURE, BIRCH, Chameleon, LSI and others [6-8]. However, 
most current techniques require a user-given parameter, such as k for fc-means. 
One solution called X-means [9] uses BIC to determine k. However, several of 
the clustering methods suffer from the dimensionality curse (like the ones that 
require a covariance matrix); others may not scale up for large datasets. Also, 
in our case, the points and their corresponding vectors are semantically related 
(each node occurs as a point and as a component of each vector) ; most clustering 
methods do not consider this. Other related work includes information-theoretic 
co-clustering (ITCC) [10]. However, the focus there is on lossy compression, 
whereas we employ a lossless MDL-based compression scheme. No MDL-like 
principle is yet known for lossy encoding, and hence, the number of clusters in 
ITCC cannot (yet) be automatically determined. Besides these, there has been 
work on conjunctive clustering [11] and community detection [12]. 

In conclusion, the above methods miss one or more of our prerequisite prop- 
erties, typically not being automatic (PI). Next, we present our solution. 

3 Proposed Method 

Our goal is to find patterns in a large graph, with no user intervention, as shown 
in Figure 3. How should we decide the number of node groups k along with the 
assignments of nodes to their “proper” groups? 

Compression as a Guide: We introduce a novel approach and propose a general, 
intuitive model founded on compression, and more specifically, on the MDL 
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(Minimum Description Language) principle [13]. The idea is the following: the 
binary n x n matrix represents associations between the n nodes of the graph 
(corresponding to rows and columns in the adjacency matrix). If we mine this 
information properly, we could reorder the adjacency matrix so that “similar” 
nodes are grouped with each other. Then, the adjacency matrix would consist 
of homogeneous rectangular/square blocks of higlr(low) density, representing the 
fact that certain node groups have more(less) connections with other groups. To 
compress the matrix, we would prefer to have only a few blocks, each of them 
being very homogeneous. However, having more groups lets us create more homo- 
geneous blocks (at the extreme, having n groups gives n 2 perfectly homogeneous 
blocks of size lxl). Thus, the best compression scheme must achieve a tradeoff 
between these two factors, and this tradeoff point indicates the best number of 
node groups k. We accomplish this by a novel application of the overall MDL 
philosophy, where the compression costs are based on the number of bits required 
to transmit both the “summary” of the node groups, as well as each block given 
the groups. Thus, the user does not need to set any parameters; our algorithm 
chooses them so as to minimize these costs. 

3.1 Compression Scheme for a Binary Matrix 

Let D = [dij\ denote an n x n adjacency matrix. Each graph node corresponds 
to one row and column in this matrix. We assume that n > 1. Let us index the 
rows and columns as 1,2 , ,n. 

Let k denote the number of disjoint node groups. Let us index the groups as 
1,2,..., .fe. Let 

a ; {1,2 n} — {1,2,...,*:} 

denote the assignments of nodes to groups. We can rearrange the underlying data 
matrix D so that all nodes corresponding to group 1 are listed first, followed by 
nodes in group 2, and so on. Such a rearrangement, implicitly, sub-divides the 
matrix D into k 2 smaller two-dimensional rectangular/square blocks, denoted by 
Di j, i,j = l,...,k. The more homogeneous these blocks, the better compression 
we can get, and so, the better the choice of Q. Table 1 lists the symbols used 
later. 

We now describe a two-part code for the matrix D. The first part will be 
a description complexity involved in describing the blocks formed by Q and the 
second part will be the actual code for the matrix given information about the 
blocks. A good choice of Q will compress the matrix well, which will lead to low 
total encoding cost. 

Description Cost: The description complexity (ie., information about the rect- 
angular/square blocks) consists of the following terms: 

1. Send the number of nodes n using log*(n) bits, where log* (a:) = log 2 (:r) + 
log 2 log 2 (:r) + . . . with only the positive terms being retained [14]. This term 
is independent of Q and k, and, hence, while useful for actual transmission 
of the data, will not figure in our framework. 
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2. Send the node permutations using n [ log n] bits, respectively. Again, this 
term is also independent of Q and k. 

3. Send the number of groups k in log* k bits. 

4. Send the number of nodes in each node group. Let us suppose that a\ > 
a -2 > • • • > afc > 1. Compute a, = ^ Ylt=i at ) — k + i for alii = 1, . . . , k — 1. 

Now, the desired quantities can be sent using R°g bits 

5. For each block D t j ( i,j = 1 ,...,/c), send w(Dij), namely, the number of 
“l”s in D it j using |~log(a;a.,' + 1)] bits. 

Code Cost: Suppose that the entire preamble specified above (containing infor- 
mation about the square and rectangular blocks) has been sent. We now transmit 
the actual matrix given this information. We can calculate the density Pij of 
“l”s in Dij using the description code above. The number of bits required to 
transmit Dij is 

C(D itj ) = n (/I,./, //(/',/) (1) 

= -w(Dij) log ( Pij ) - [n(Dij) - w(Dij)} log(l - P itj ) 

where H is the binary Shannon entropy function, n(Dij) = aiCij, and all loga- 
rithms are base 2. Summing over all the Di j submatrices: 

k k 

Code cost = EE c (a,,) ( 2 ) 

*= 1 3=1 

Total Encoding Cost: We can now write the total cost for the matrix £), with 
respect to a given k and Q as: 



k— 1 k k k k 

T(D ; k, Q) := log k + ^ ( [log ai\ + EE [log(aiaj + 1)] + EE C(Aq) (3) 

i= 1 i= 1 j = 1 i = 1 j = 1 



ignoring the costs log*(n) and nflogn] since they are independent of Q and k. 

3.2 Problem Formulation 

We want an algorithm that can optimally choose k* and Q* so as to minimize 
T{D\ k*,Q*). Typically, such problems are computationally hard, and hence, in 
this paper, we shall pursue feasible practical strategies. We solve the problem 
by a two-step iterative process: First, find a good node grouping Q for a given 
number of node groups k; and second, search for the number of node groups k. 
For the former, we describe an iterative minimization algorithm to find a Q that 
effectively finds a minimum, given a fixed number of node groups k. Then, we 
outline an effective heuristic strategy that searches over k to minimize the total 
cost T(D- k, Q). 
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3.3 Algorithms 

In the previous section we established our goal: Among all possible values for 
k, and all possible node groups G, pick an arrangement which reduces the to- 
tal compression cost as much as possible, as MDL suggests (model plus data). 
Although theoretically pleasing, Equation 3 does not tell us how to go about 
finding the best arrangement - it can only pinpoint the best one, among several 
candidates! The question is: how can we generate good candidates? 

We answer this question in two steps: 

1. [innerLoop] For a given k, find a good arrangement Q. 

2. [OuterLoop] Efficiently search for the best k ( k = 1,2,.. .). 



Algorithm InnerLoop (Finding G given k ): 

1. Let t denote the iteration index. Initially, set t — 0. If no G{ 0) is provided, 
start with an arbitrary (5(0) mapping nodes into k node groups. For this initial 
partition, compute the submatrices Di t j(t), and the corresponding distributions 
Pi, 3 (f)- 

2. For every node x, splice the corresponding row into k parts x r0 w,i, • • • , x r0 w,k 
according to G(t) (i.e., x r0 w,i = {d x , u \Gu{t) = 1} and so on). Similarly, splice 
the column into k parts x co i, l, ■ ■ ■ , x co l,k ■ Compute the number of “l”s w(x 7 •„,,) 
and w (x co i,j) (5 = 1... k) for all these parts. Now, assign node x to node group 
Gx{t + 1) such that 

Gx (t + 1) = arg min 

1 <i<k 

k 

\w(x row ) log Pi : j (t) “b {n(x roW: j ) wigc r ow ,j)) log(l Pi,j (l) ) 

3 = 1 

+ w(x co i :j ) log Pj,i(t) + ( n(x co ij ) - w(x co ij)) log(l - Pj,i(t))\ 

+dx, x [log P ii g x (t) (f ) + log (t) , j (t) - log Pi, i(t)] 

+ (1 - dx,x) [log(l - Pi,g x (t){t )) + log(l - Pg x {t),iit)) - iog) 1 - p i,i(t))] 

where the first two lines denote the cost of shifting the row and column cor- 
responding to node i to a new group, while the last two lines account for the 
“double-counting” of the cell d x , x in the adjacency matrix. 

3. With respect to G(t+ 1), recompute the matrices D( j) 1 , and the corresponding 
distributions P/ + 1 . 

4. If there is no decrease in total cost, stop; otherwise, set t = t + 1, go to step 2, 
and iterate. 




Fig. 1 . Algorithm InnerLoop. 

The InnerLoop algorithm iterates over several possible settings of Q for the 
same number of node groups k. Each iteration improves (or maintains) the code 
cost, as stated in the theorem below. 
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Theorem 1. After ach iteration of InnerLoop, the code cost decreases or re- 
mains the same. The proof is omitted for lack of space. 

The loop finishes when the total cost stops improving. Note that it is possible 
for some groups to be empty, but this is not a problem. The complexity of 
InnerLoop is 0(w(D) • k • I) where I is the number of iterations. 



Algorithm OuterLoop (Finding k): 



1. Let T denote the search iteration index. Start with T = 0 and fc(0) = 1. 

2. At iteration T, try to increase k: k(T + 1) = k(T) + 1. Split the node group r 
with maximum entropy per node, i.e., 



r := arg max 

1 <i<k 



E 

i <j<fc 



n{Dj,j)H + n(Dj,i)H(Pj,i ) 

ai 



Construct an initial label map Q{T-\- 1) as follows: For every node x that belongs 
to group r (i.e., for every 1 < x < n such that Q x (T) = r), place it into the 
new group k(T + 1) (i.e., set Q X (T + 1) = k(T + 1)) if and only if it decreases 
the per-node entropy of the group r , i.e., if and only if 



E 

1 <j<k 



n (D'r,j)H (Pfj) +n(Df r )H(Pf r 
o r — 1 



< E 

1 <j<k 



n(D r j)H(P r: j') + n(Dj : r)H(Pj, 



where D' r j is D r ,j without node x. Otherwise let Q X (T + 1) = r — Q X {T). If we 
move node x to the new group, we also update D r j and Dj tT (for all 1 < j < k) 
accordingly. 

3. Run the InnerLoop algorithm with initial Q = Q{T + 1) and k = k{T + 1) to 
find a new node mapping Q(T + 1) and the corresponding total cost. 

4. If there is no decrease in total cost, stop and return k * = k(T) and Q* = Q{T). 
Otherwise, set T = T + 1 and continue. 



Fig. 2. Algorithm OuterLoop. 



The OuterLoop algorithm tries to look for good values of k. It chooses the 
node group with the maximum entropy per node, and splits it into two groups. 
The nodes put into the new group are exactly the ones whose removal reduces 
the entropy per node in the original group. As shown below in Theorem 2, this 
split never decreases the code cost. 

Theorem 2. On splitting any node group, the code cost either decreases or re- 
mains the same. The proof is omitted due to lack of space. 

By Theorem 1, the same holds for InnerLoop. Therefore, the entire algo- 
rithm also decreases the code cost (Eq. 2). However, the description complexity 
evidently increases with k. We have found that, in practice, this search strat- 
egy performs very well. The OuterLoop algorithm is run k* times, so the overall 
complexity of the search is 0(w(D)(k*) 2 I). In practice, I < 20 is always enough. 
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(a) Original (b) OuterLoop (c) InnerLoop (d) OuterLoop (e) OuterLoop (f) InnerLoop (g) OuterLoop 
(fe = l) (it: 1 — 2) (k = 2) (k: 2 — 3) (Jfc : 3 — 4) (fc = 4) (k : 4 — 5) 

Fig. 3. Algorithm execution snapshots: Starting with a randomly permuted “caveman” 
matrix (a), the algorithm applies OuterLoop and InnerLoop till the final structure 
(g) is revealed. We omit the InnerLoop results when they produce no improvement. 
Iterations of OuterLoop are separated by vertical lines for clarity. 



Figure 3 shows an execution snapshot of the full algorithm on a randomly 
permuted “caveman” matrix (ie., a block diagonal matrix [15]) with Zipfian 
cave-sizes. OuterLoop increases the number of node groups while InnerLoop 
rearranges nodes between groups. No plots are shown when the InnerLoop does 
not decrease the total cost. The correct final result is shown in plot (g). 

3.4 Online Recomputations 

When new nodes are obtained (such as from new crawls for a Web graph), we can 
put them into the node groups which minimize the increase in total encoding cost 
due to their addition. Based on the same principle, when new edges are found, 
the corresponding nodes can be reassigned to new node groups. The algorithm 
can then be run again with this initialization till it converges. Similar methods 
apply for node/edge deletions. Thus, new additions or deletions can be handled 
without full recomputations. 

3.5 Using the Block Structure 

Having found the underlying structure of a graph in the form of node groups, 
we can utilise this information to further mine the data. Again, we use our 
information-theoretic approach to answer several tough problems efficiently, us- 
ing the node groupings found by the previous algorithms. 

Outlier Edges: Which edges between nodes are abnormal/suspicious? Intuitively, 
an outlier shows some deviation from normality, and so it should hurt attempts 
to compress data. Thus, an edge whose removal significantly reduces the total 
encoding cost is an outlier. Our algorithm is: Find the block where removal of an 
edge leads to the maximum immediate reduction in cost (that is, no iterations of 
the InnerLoop and OuterLoop algorithms are performed). All edges within that 
block contribute equally to the cost, and so all of them are considered outliers. 

“Outlierness” of edge (w, v) := T{D'\ fc, Q) — T{D\ k , Q) (5) 



where D' = D except that d' u v = 0. This can be used to rank the edges in terms 
of their “outlierness” . 
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Table 2. Dataset characteristics. 



Dataset 


Num nodes 


Num edges 


Remarks 


CAVE 


900 


170, 800 


Five “caves” with zipfian sizes 


CAVE-Noisy 


900 


190,117 


10% noise added to the above 


NOISE 


100 


1,831 


Pure white noise 


EPINIONS 


75, 888 


508, 960 


“Who-trusts-whom” data 


DBLP 


6,090 


175, 494 


Coauthorship and cocitation data 



“Distance'’ Between Node Groups: How “close” are two node groups to each 
other? Following our information theory footing, we propose the following crite- 
rion: If two groups are “close”, then combining the two into one group should 
not lead to a big increase in encoding cost. Based on this, we define “distance” 
between two groups as the relative increase in encoding cost if the two were 
merged into one: 

jj. £/. ., Cost(merged) — Cost{i ) — Cost(j) 

Cost[i) + Cost(j) 



where only the nodes in groups i and j are used in computing costs. We exper- 
imented with other measures (such as the absolute increase in cost) but Eq 6 
gave the best results. The cost of computing both outliers and distances be- 
tween groups is independent of the number of non-zeros w(D), and so both can 
be performed for large graphs. 

4 Experiments 

We did experiments to answer the following questions: (i) how good is the quality 
of the node groups found, (ii) how well do our algorithms find outlier edges, (iii) 
do our measures of “distances” between node groups make sense, and (iv) how 
well does our method scale up. All our experiments require no input parameters, 
which rules out other methods like METIS or spectral partitioning. 

We used several datasets (see Table 2), both real and synthetic. The synthetic 
ones were: (1) CAVE, representing a social network of “cavemen” [15], that is, a 
block-diagonal matrix of variable-size blocks (or “caves” ; members of a cave form 
a clique, and know only those from their own cave), (2) CAVE-Noisy, created 
by adding noise (10% of the number of non-zeros) , and (3) NOISE, with pure 
white noise. The real-world datasets are: (4) EPINIONS, a “who-trusts-wlrom” 
social graph of www . epinions . com users [16], and (5) DBLP, a graph obtained 
from www.informatik.uni-trier.de/~ley/db, with the nodes being authors 
in SIGMOD, ICDE, VLDB, PODS or ICDT (database conferences); two nodes are 
linked by an edge if the two authors have co-authored a paper or one has cited a 
paper by the other (thus, this graph is undirected). We performed experiments 
on other datasets too; the results were similar, and are not reported to save space. 
Our implementation was done in MATLAB (version 6.5 on Linux) using sparse 
matrices. The experiments were performed on an Intel Xeon 2.8GHz machine 
with 1GB RAM. 
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(a) CAVE (b) CAVE-Noisy (c) CAVE-Noisy (d) NOISE 

(Final) (Original) (Final) (Final) 



Fig. 4. Synthetic datasets: (a) Our method gives the intuitively correct groups for 
CAVE (Figure 3(a) shows the original graph). (b,c) The results remain the same in 
spite of noise in CAVE-Noisy, showing the robustness of the algorithm, (d) The NOISE 
dataset shows 4 groups, which are explained by the patterns emerging due to random- 
ness, such as the “almost-empty” and “more-dense” blocks. 




Fig. 5. Real datasets: Shaded blocks are shown instead of the actual points; darker 
shades correspond denser blocks. The plots show how the algorithm has separated the 
graph into large but extremely sparse, and small but very dense groups. Most well- 
known database researchers show up in the dense regions of plot (a), as expected. 



4.1 Quality 

Results - Synthetic Data: Figure 4 shows the groupings found by our method 
on several synthetic datasets. For the noise-free CAVE matrix, we get exactly the 
intuitively correct groups (plot a). When noise is present (plot b), we still get 
the correct groups (plot c), demonstrating the robustness of our algorithm. Plot 
(d) shows 4 groups for the NOISE graph. This is expected; it is well known that 
spurious patterns emerge even when we have pure noise, and our algorithm finds 
blocks of clearly lower or higher density. 

Results - Real Data: Figure 5 shows the groupings found on several real-world 
datasets. For the DBLP dataset, eight groups were found. Group 8 is comprised 
of only Michael Stonebraker, David DeWitt and Michael Carey; these are well- 
known people who have a lot of papers and citations. The other groups show 
decreasing number of connections but increasing sizes, with group 1 being the 
largest but having the lowest connectivity. Similarly, for the EPINIONS graph, 
we find a small dense “core” group which has very high connectivity, and then 
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(a) Outlier-test (b) Node groups (c) Distance-test (d) Node groups 

graph and outliers graph and distances 



Fig. 6. Outliers and group distances: Plot (b) shows the node groups found for graph 
(a). Edges in the top-right block are correctly tagged as outliers. Plot (d) shows the 
node groups and group distances for graph (c). Groups 2 and 3 (having the most 
“bridges”) are tagged as the closest groups. Similarly, groups 1 and 2 are the farthest. 



larger and less heavily-connected groupings. Thus, our method gives intuitive 
results for real-world graphs too. 

4.2 Outlier Edges 

To test our algorithm for picking outliers, we use a synthetic dataset as in Fig- 
ure 6(a). The node groups found are shown in 6(b). Our algorithm tags all 
edges whose removal would best compress the graph as outliers. Thus, all edges 
“across” the two groups are chosen as outliers under this principle (since all edges 
in a block contribute equally to the encoding cost), as shown in Figure 6(b). 
Thus, the intuitively correct outliers are found. 

4.3 Distances Between Node Groups 

To test for node-group distances, we use the graph in 6(c) with 6(d) showing the 
structure found. The three caves have equal sizes but the number of “bridge” 
edges between groups varies. This is correctly picked up by our algorithm, which 
ranks groups with more “bridges” as being closer to each other. Thus, groups 2 
and 3 are tagged as the “closest” groups, while groups 1 and 2 are “farthest”. 

4.4 Scalability 

Figure 7 shows wall-clock times (in seconds) of our MATLAB implementation. 
The dataset is a “caveman” graph with 3 caves; the size of the graph and the 
number of edges in it are varied for the experiment, with the relative propor- 
tions of cave sizes being kept fixed. The execution time increases linearly with 
respect to the number of non-zeros, as expected from our orcler-of-complexity 
computation. Thus, our proposed method can scale to large graphs. 

5 Conclusions 

We considered the problem of finding the underlying structure in a graph. We 
introduced a novel approach and proposed a general, intuitive model founded on 
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Time versus number of edges (nonzeros) 

Fig. 7. Scalability: On a 3-cave graph, wall-clock execution time grows linearly with 
the number of edges. Thus, our method can scale to large graphs. 



lossless compression and information-theoretic principles. Based on this model, 
we provided novel algorithms for finding node groups and outlier edges, as well 
as for computing distances between node groups, thus fulfilling all our goals 
(G1)-(G3) from Section 1. Our algorithms are fully automatic and parameter- 
free, scalable and allow online computations, achieving properties (P1)-(P3). 
Finally, we evaluated our method on several real and synthetic datasets, where 
it produced excellent and intuitive results. 
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Abstract. A calibrated classifier provides reliable estimates of the true 
probability that each test sample is a member of the class of interest. 
This is crucial in decision making tasks. Procedures for calibration have 
already been studied in weather forecasting, game theory, and more re- 
cently in machine learning, with the latter showing empirically that cali- 
bration of classifiers helps not only in decision making, but also improves 
classification accuracy. In this paper we extend the theoretical founda- 
tion of these empirical observations. We prove that (1) a well calibrated 
classifier provides bounds on the Bayes error (2) calibrating a classifier 
is guaranteed not to decrease classification accuracy, and (3) the proce- 
dure of calibration provides the threshold or thresholds on the decision 
rule that minimize the classification error. We also draw the parallels 
and differences between methods that use receiver operating character- 
istic (ROC) curves and calibration based procedures that are aimed at 
finding a threshold of minimum error. In particular, calibration leads to 
improved performance when multiple thresholds exist. 



1 Introduction 

In a decision making task, in order to evaluate different courses of action, it is 
useful to obtain accurate likelihood estimates of the alternatives. Pattern clas- 
sifiers can be used to provide automated mappings between situations (repre- 
sented by features) and outcomes (represented by the class membership). Yet, 
to be applicable to decision making problems, we require a reliable estimate of 
the true probability of class membership of each test sample. We will use the 
term calibrated to refer to classifiers with reliable estimates of the class member- 
ship probabilities. A successful classifier in terms of classification accuracy is not 
necessarily calibrated, e.g., the Naive Bayes classifier. Procedures for calibrat- 
ing classifiers have been proposed in different contexts: In weather prediction 
tasks [1], in game theory [2,3], and more recently in the context of pattern clas- 
sification [4,5]. Zadrozny and Elkan were also the first to notice the need of 
calibrating classifiers when used as decision making aids. 

Our own incentive to study calibration came from applying probabilistic 
based classifiers to the problem of characterizing and forecasting the I/O re- 
sponse time of large storage arrays given passive observations. As these forecasts 
are used for scheduling purposes, we also need to accompany each forecast with 
an accurate estimate of the probability of the forecast. We applied a variant 
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of the calibration procedure suggested in [1,4] and noticed that in addition to 
producing more accurate estimates, the classification accuracy of our induced 
classifiers increased. While these empirical results agree with those of Zadrozny 
and Elkan [4,5], a theoretical guarantee that calibration cannot degrade classi- 
fication performance was still missing. Our investigation of the calibration pro- 
duced the following results which we prove in Sections 3 and 4. First, we can 
bound the Bayes error using the same parameters that result from the calcu- 
lations needed for calibration. Second, we are guaranteed that the classification 
accuracy of the original classifier does not decrease as a consequence of calibra- 
tion. Moreover, the classification accuracy can actually increase. Third, using 
the calibration process we can compute a threshold or thresholds in the decision 
rule of the classifier that minimize the classification error. We show that when 
a single threshold is derived from the calibration procedure, the result is equiv- 
alent to finding the point of minimum error in an ROC curve [6,7]. However, 
when calibration produces multiple thresholds on the decision rule, the error 
achieved with those is lower than that of any single threshold derived from the 
ROC based methods. Thus, in addition to producing more accurate estimates 
of a-posteriori probabilities, calibration obviates the need for using ROC based 
methods for finding optimal thresholds. 

The rest of the paper is organized as follows. Section 2 introduces formally 
the notions of calibration, refinement and Brier score. Sections 3 and 4 contain 
the proofs of our main results. Section 5 illustrates the effects of calibration on 
classifiers induced on real data, observing also the effect of the sample size on 
the process of calibration. Finally, Section 6 discusses and summarizes the main 
results. 

2 Notation and Preliminary Definitions 

A classifier takes an incoming vector of features X and maps it to a class label. 
We will use C to denote the class variable the values of which are called classes. 
Throughout this paper we assume a binary classification problem, i.e. , one in 
which C takes one of two values. We will use (1,0), or (c, c) to denote a specific 
instantiation of C . Each instantiation of X, denoted by x is a sample. We assume 
that all samples are i.i.d. 

Let p(C|X) be the true a-posteriori distribution of the class given the features. 
The optimal classification rule, that is, the optimal function that maps a sample 
x to one of the values of C, under the 0-1 cost function, is the maximum a- 
posteriori (MAP) rule [8]: 

5 *(x) = argmax c ' = ( 0<1 )\p(C = c'|x)], (1) 

The decision rule g*(x) is called the Bayes optimal decision and 
e B = J^p(g*(x) ± c|C = c)p(x) 

X 

= 'Y^min{p{C = l|x) ,p(C = 0|x))p(x) , 



(2) 
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is the associated probability of error. This error is known as the Bayes error (or 
Bayes risk), and it is the minimum probability of error achievable with the given 
set of features 1 . 

Given that p(<7|X) is unknown, one strategy for classification is to induce 
an estimate p(C\X) of the a-posteriori probability, and then use a decision rule 
g(X) such that the classification error, given by 

CE = ^2p(g(x) ± c\C = c)p(x) (3) 

X 

is minimized. We note that plugging in p(C\X) into the decision rule in Eq. 1 may 
not be optimal [6], since given the errors and biases embedded in the estimate 
p(C|X) the threshold of 0.5, implicit in Eq. 1, may not minimize the error in 
Eq. 3. We return to this subject in Section 4, where we show the link between 
calibration and the decision rule that minimizes Eq. 3. 

The classification error provides one way to evaluate classifiers. However, 
when using the classifier output as a basis for decision making, we need a score 
that takes into account not only the prediction accuracy of the classifier, but 
also the quality of the estimate p((7|X). One such score is the Brier score [9]. 
The Brier score is one of a class of so-called proper scores [1] which are used in 
evaluating the subjective probability assessment of forecasters. For the binary 
classification case, the Brier score is given as the average squared difference 
between the forecaster’s probability of C = 1 and the true label: 

n 

BS=-J2(p(C=Mxi)-Ci) 2 , (4) 

i= 1 

where n is the number of samples. Among the various intuitive justifications 
of this score, the following one is based on decision theoretic considerations. 
Assume that the agent (classifier or forecaster), should pay a price proportional 
to the confidence with which it asserts its decision. The Brier score uses the 
probability of the estimate as providing the appropriate penalty. Note that in 
Eq. 4, if the agent predicts (7=1 with high probability but d = 0 the penalty 
will be higher than if he predicts (7 = 1 with low probability. Thus, the lower 
the Brier score, the lower is the penalty assessed to the agent. 

The notion of calibration can be derived directly from the Brier score. We 
need some preliminary definitions. Let t £ [0, 1] denote the a-posteriori prob- 
ability assessment of a forecaster. Following [1], we assume that t takes on a 
finite number of values on the interval [0, 1]. We denote by Rt the set of feature 
values for which the classifier density, p(C = l|x), yields a forecast probability 
t, namely: 

Rt = {xeX:p(<7=l|x) = f}. (5) 

1 Note that the summation over X implies finite values for the features; for continuous 
features the summation is replaced by integration. Throughout the paper we main- 

tain the summation over X, but note that the analysis holds for continuous features 
as well. 
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Let 7r(t) be the probability that the forecaster predicts <7 = 1 with probability 
t on a random instance, tt ( t) can also be thought of as the frequency at which 
the forecaster predicted (7 = 1 with probability t on a set of N samples, with 
N — * oo. As such, given the probability density of the features, p(x), n(t) can 
be expressed as: 

7r(f) = p( x ) • (6) 

xG-Rt 

Let p(C = l|t) be the probability that (7=1 given that the forecaster pre- 
dicts (7 = 1 with probability t. The Brier score can be rewritten as (see [1] for 
derivation) : 

bs =Y ^x* - .p(ci*)) 2 + Y 7r (*)p( c i t ) ( 1 - p ( c i^))- ( 7 ) 

* t 

The first term is a measure of the calibration, and the second term is a measure 
of the refinement of the forecaster, denoted as R. Calibration indicates how close 
is the probability assessment of the forecaster on (7 = 1 to the frequency with 
which (7=1 occurs (in reality). Note that for calibration to be 0, t has to be 
p(c\t) for every t. A well- calibrated forecaster is one with calibration equal to 0. 
The notion of calibration fits our purposes, since the probability assessments of 
a well-calibrated agent, can be used in decision making as an indication of its 
confidence of the classification label provided. 

Refinement scores the usefulness of each forecast. As an illustration, assume 
that we live in a place that rains 50% of the time. Thus a forecaster that always 
announces rain with 50% confidence is calibrated, yet not very useful in helping 
to plan a picnic for the following day. Ideally we would like estimates that are 
close to certainty. The more concentrated p(c\t) is towards 0 or 1, the more 
refined the classifier. To minimize the overall Brier score, a forecaster has to be 
both well-calibrated and refined. Thus, if two classifiers are well-calibrated, the 
one with the lower Brier score is also more refined. We describe the relationship 
between bias, Bayes error, calibration and refinement in the next section. 

3 The Brier Score, Bias, and the Bayes Error 

In the following we show that being well-calibrated is a weaker condition than 
being unbiased. Loosely speaking, a well-calibrated classifier is an “on average” 
unbiased classifier. We also show that we can use the notion of refinement (second 
term in Eq. 7) as a bound on the Bayes error. In particular, twice the refinement 
of a well-calibrated classifier is an upper bound on the Bayes error; and, in 
the case that the classifier is unbiased, then its refinement is a lower bound on 
the Bayes error. Section 5 illustrates the practical implications of the various 
approximations made when calibrating in practice. 

3.1 The Bias/Calibration Relationship 

Being well-calibrated requires that t = p(c\t). Using Bayes rule we write p(c\t) 
as which can be further rewritten as: 

TT(t) 
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_ Exgfl. t p( x )p( c l x ) 
E x gfl t p( x ) 



(8) 



where n(t) in the denominator is replaced using Eq. 6. The numerator states that 
the probability of the joint event that the class variable takes its c value and that 
the classifier states this with probability t, is the result of summing over these 
precise events in feature space (i.e, for x £ Rt). Given our assumption regarding 
the i.i.cl. nature of the samples, this holds. We can now state the following: 

Proposition: An unbiased classifier is also well-calibrated. 

Proof. For an unbiased classifier, lim„_ >00 p(c|x) = p(c|x) for every x, where n 
is the number of samples. Therefore, as n — > oo, for every t: Vx £ R tl t = p(c|x). 
Replacing p(c|x) with t in Eq.(8) yields p(c\t) = t, which is the condition for 
calibration to be ().□ 

However, a well-calibrated classifier might not be unbiased. We see from 
Eq.(8) that for a well-calibrated classifier, its forecast, p(c\x) = t for x £ R t , 
is a normalized average of the true a-posteriori probability in the region de- 
fined by R t . Clearly, one can construct cases where the classifier is biased, but 
p(c\t) = t for all t : for example, suppose we have X = {1,2}, p(X = 1) = 0.5 
and p(c\X = 1) = 0.2 and p(c\X = 2) = 0.6. Suppose also that the classifier 
always predicts c with p(c\X) = 0.4 for any X (hence on t = 0.4 has non-zero 
probability). Obviously, the classifier is biased. However, from Eq. 8 we have 
that p(c\t) = 0.4 and the classifier is well-calibrated. 



3.2 The Bayes Error-Refinement Relationship 

We start by defining a t dependent error measure: 

et = 7r (t)min(t, 1 — t) . (9) 

t 

et essentially mirrors the Bayes error formula of Eq.2, but as we will see, et. upper 
bounds the Bayes error. We are now ready to state the following result: 

Theorem 1 Given a well- calibrated classifier, whose forecasts are p{c\x), and 
given the true a-posteriori probability p(c\x) with corresponding Bayes error rate 
e B , the following holds: e B < e* < 2R. 

Proof. Recall that for a well-calibrated classifier, t = p(c\t). Making the appro- 
priate substitution in the second term of Eq. 7, the refinement R can be written 
as:R = Yl t 7r(i)i(l — t). It is easy to show that for 0 < t < 1, min(t , 1 — t) < 
2 • t( 1 — t), from which follows that, e* < 2R.Now we have to show e B < e*. We 
rewrite the expressions for the Bayes error in terms of t and Rp. 

e B = ^2 min (p( c l x ) > 1 - P( c l x ))- ( 10 ) 

t x£R t 

We use Eq. 6 to substitute the n(t) term in Eq. 9 and obtain: 
e t = p(x) min(t , 1 — t). 

t xg Rt 



( 11 ) 
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With this reformulation, all we have to show is that for every x in every Rt, 
p(x) min{p(c\x) , 1— p(c|x)) < p(x) min(t , 1 —t). We have two cases, when t < 0.5 
and when t > 0.5. We proceed with the proof for the first case. The proof for the 
second case is completely analogous. For the case where t < 0.5 we can write: 

Y p(x) min(t, 1 -t)=t- Y p( x ) 

xGi?t xGilt 



Using Eq. 8 and the fact that the classifier is well calibrated we replace t in the 
right hand side of the above equation to get: 



t Y 

xeiit 



£xgfl t P( X ) P( c l x ) 
Exefl t P( x ) 

Y v{*)p{c |x) • 

xG-Rt 



Y 

xGiJt 



(12) 



In going over all x € Rt, we have two cases, depending on whether p(c|x) < 
0.5 or p(c|x) > 0.5 2 . Let x~ be such that p(c|x“) < 0.5. Thus we get that 
min(p(c |x“) ,1 — p(c|x”)) = p(c|x”). It follows then that Eqs. 10 and Eq. 12 
are equal for all such cases. Let now x' € Rt be such that p(c|x') > 0.5. For 
that x', min^clx') , 1 — p(c|x')) = 1 — p(c|x'). So, while e* sums over p(c|x'), 
as in Eq. 12, the Bayes error adds the smaller term, 1 — p(c|x'). It follows that 
e B < e t . □ 

From the proof, we see that ‘looseness’ in the upper bound on the Bayes 
error occurs whenever for certain x £ R t , p(c|x) is on the other side of 1/2 with 
respect to t. For f’s that are close to 0 or 1, there is less of a chance for such 
x’s to exist (see Eq. 8), while t close to 1/2 has higher chances of occurrence for 
such cases. Therefore, a well-calibrated classifier with 7r (t) that has mass close 
to 0 and 1 is not only more refined, but also provides a tighter bound on the 
Bayes error. 

If the classifier is unbiased, we can provide a stronger result. In this case we 
know that asymptotically, t = p(c|x) for every x £ R t and we have that R < e B . 
This follows from the fact that e B = e* when the classifier is unbiased, and from 
the fact that for 0 < t < 1, the relation t(l — t) < min(t , 1 — t) holds. 



4 Calibration, Classification Error and ROC Curves 

As discussed in Section 2, in order to minimize the classification error given by 
Eq. 3, we need to find the appropriate decision rule. This, in turn, translates to 
finding a probabilistic threshold a , so that we can classify a sample x as belonging 
to class c, when p(c|x) > a. In this section we provide a procedure for finding 
a in terms of calibration. The intuition is as follows. If we had the real density 
p(C|X), the optimal decision rule is given by Eq. 1, which in turn implies that 
a = 0.5. Now, the process of calibrating may be seen as the process of bringing 

2 Recall that these x samples are placed in Rt according to the value of p(c|x). 
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p{C\X) closer to the real density. Calibrating a classifier is a mapping from p(c\x) 
to p(c\t). In fact the procedures proposed in [4,5] essentially implement this 
mapping. Thus, under certain conditions we outline below the optimal threshold 
a* of the original classifier is one where in the calibration mapping p(c\a*) = 0.5. 

To formalize this intuitions we need to express the classification error in terms 
of the calibration mapping density. Suppose that our decision function on t is 
such that we say C = 0 if t < a and C = 1 if t > a, where 0 < a < 1 (note 
that for the plug-in decision rule a = 0.5). Given the density of n(t) on t, the 
classification error is a function of a and is written as: 

p(C = l\t) p(t) dt + f (1 — p(C = l\t))p(t) dt, (13) 

J a. 

where now t takes any value on the interval [0,1], and is not limited to a discrete 
set as in the previous section. The first integral in Eq. 13 is the (weighted) area 
under the calibration map, p(C = l|t), for which we predict class 0; this area 
is proportional to the probability that we missed instances that had label of 1. 
The second integral provides the proportion of the error for which we predict 1, 
but the actual class label was 0. Borrowing terms from signal detection theory, 
the first term is proportional to the probability of missed detection (detection 
of class 1), and the second integral is proportional to the probability of false 
detection (or false alarm). These areas are illustrated as the marked regions in 
Figures 1(a) and (b). We can now state the following: 






» = f 

Jo 




(a) 



(b) 



Fig. 1 . (a) Illustration of the calibration map of a well-calibrated (diagonal line) and 
a non-calibrated classifier in (b). 



Theorem 2 Given a classifier with a-posteriori probabilities t, density n (t) and 
a calibration map p{C = l|f), where p(C = 1 |t) does not cross 1/2 more than 
once, the threshold a on t which achieves minimum probability of error, i.e., 
a* = argmin a P error (a) is given as a s.t. p(C = l|f = a) = 0.5. 
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Proof. Taking the derivative of P er ror{a) with respect to a yields: 

= (14) 

Setting the derivative to 0 yields p(C = l|a*) = 1/2. □ 

The reason why the calibration map provides the optimal threshold on t for 
minimizing the probability of error is quite simple: the calibration map can be 
thought of as a new well-calibrated classifier, with a single feature t - thus the 
threshold of 1/2 on this new classifier is optimal. Because we require that the 
calibration map does not cross 1/2 more then once, there is a (single) threshold 
on our “feature” t that achieves the minimum error. 

The function p(C = l|f) can also be used to create ROC curves. To see this, 
recall that an ROC curve plots the probability of detection, Pd = P(Predict C = 
1| Truth is C = 1), against the probability of false alarm, Pf A = P{Predict C = 
1| Truth is C = 0), created by varying a threshold (e.g., likelihood ratio). The 
threshold is varied so that we start from perfect detection, but maximum false 
alarm, to no false alarms, but minimum detection. We already stated that the 
two integrals composing P er ror hr Eq. 13 are directly related to Pd and Pfa, 
and to put it more accurately: 



Pd {a) 
Pfa{o) 



1 -fo a p( C = x l t)p(dt) 
p{C = 1) 

lli 1 - P{C = l\i))p{dt) 
l-p(C=l) 



(15) 



thus by varying the threshold a, we can generate the entire ROC curve using the 
calibration map. At this point it is clear that methods that find the threshold 
of minimum error from ROC curves [6, 7] produce the exact same result as the 
calibration procedure, when the calibration map does not cross 1/2 more than 
once. 

However, the calibration procedure generalizes more than what can be achieved 
with the ROC method. Theorem 2 can be extended to the case where the cal- 
ibration map crosses 1/2 more than once, requiring multiple thresholds on the 
original decision function for minimizing the error: given multiple thresholds on 
the decision function we can rewrite equation 13 (splitting the integral based on 
the number of needed thresholds) and find that minimizing the probability of 
error for any number of thresholds still occurs when the calibration map is 1/2. 
Such cases could occur with classifiers that output a-posteriori probabilities that 
are ranked incorrectly. For example, suppose that one class is split into several 
clusters in space, and a classifier (for example, a linear one) separates well some 
clusters, leaving other clusters far from the decision boundary. The resultant cal- 
ibration map of such classifiers would cross 0.5 at several places, but the point 
of minimum error is still at p{C = l|t) = 0.5. Thus, inverting the calibration 
map when at that point provides several thresholds on the decision rule. We 
illustrate the above with a two dimensional example, shown in Figure 2(a). The 
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class marked with circles (class ” 1” ) consists of two clusters which are divided by 
the class marked with x’s (class ”0”). Learning a Logistic regression classifier on 
the data leads to a single linear boundary (shown in the figure), which does well 
at separating one cluster, but leaves the second one very far from the boundary. 
Thus, data from that cluster have higher probability of belonging to class ”0” 
than data from class ”0” itself. Figure 2(b) shows the calibration map of the 
Logistic regression classifier. The map crosses 0.5 at two values, thus leading to 
two decision boundaries (with the same slope of the original, but two different 
intersects). With these boundaries, both clusters of class ”1” are well separated, 
and the resultant error is significantly lower, reducing from 10% with the original 
boundary to 5.5% with the new boundaries. 




(a) (b) 



Fig. 2. Example of calibration finding multiple thresholds on the decision rule, (a) Deci- 
sion boundaries before and after calibration superimposed on data, (b) The calibration 
map of the original linear classifier. 



The results above can also be extended from the 0-1 loss to the general loss 
function, for which coi is the cost of predicting class 0 when the true class is 1 
and cio is the cost of predicting class 1 and the true class is 0. The Bayes decision 
rule minimizing the risk under this loss function calls for classifying a sample x as 
1 if p(C = l|x) > rio c |° Col [10] .As with the classification error under the 0-1 loss, 
applying the threshold ClQ c |° roi on the estimated classifier, p(C = l|x), may not 
minimize the risk under the generalized loss. However, using the same arguments 
given in Theorem 2, finding the thresholds which minimize the generalized loss 
function for a given classifier is the value on t for which p{C = l|t) = . 



5 Calibration with Finite Data 

With finite data sets, we want to estimate p(C = 1| p(C = l|a;)) reliably. A pro- 
cedure for this estimation was provided in [4, 5], where p{C = l|a;) is binned on 
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the interval [0, 1] and the calibration map is estimated by counting the number 
of samples that fall into each bin. The procedure was originally suggested as a 
method for calibrating Naive Bayes classifiers, but is applicable to any classi- 
fier that outputs probabilities, or a distance measure that can be converted to 
probabilities (e.g., Tree-augmented Naive Bayes [11], Logistic regression, mixture 
models, and SVMs). The empirical success of calibration on various (typically 
large sized data sets) has been shown in previous works - in this section we aim 
at providing insight to the finite sample effects that can arise with calibration. 

Estimating the calibration map involves learning a function from a scalar 
input ( p(C = l|x) to a scalar output. Thus, it is insensitive to the number of 
features in the classifier. The estimation is sensitive though to the sample size 
and to the number of bins used in the estimation procedure. We evaluate the 
effect of the sample size on the calibration procedure, thus we use learning curves, 
showing the various performance metrics before and after calibration. 

We use the calibration procedure for prediction of I/O response time of 
individual requests to an enterprize storage array. Our data is based on an 
anonymized month-long trace of requests to an Hewlett Packard XP 512 storage 
array collected by the Storage Systems group at Hewlett-Packard Laboratories 
between 27 September and 27 October 2002. The raw traces are transformed to 
10 features that describe queue lengths, locality and sequentiality, as measured 
by the server issuing the I/O request to the storage array. The problem is trans- 
formed to a binary classification problem by determining that any response time 
faster or equal to 1.5 msec is considered fast, while any response time slower 
than 1.5 msec is considered slow. 

The data consists of 686091 training data and 343046 test data. We build 
two competing models to predict the correct class for the I/O request. The first 
is the Naive Bayes classifier, with Gaussian conditional distribution for the nu- 
merical feature and multinomial distribution for a locality feature. The second 
is a mixture of regression (MoR) classifier. The MoR model finds a mixture of 
regressors between the features and response time, which provides a distribution 
of response time for each value of the features, from which we can compute the 
a-posteriori probability of the response time being fast or slow. With the full 
training data, the Naive Bayes model achieves 82.18% accuracy before calibra- 
tion and 85.60% after calibration, a significant improvement. The MoR model 
improves from 85.50% to 86.16%, a more modest improvement, to be expected 
from a model that is more naturally calibrated compared to Naive Bayes. The 
learning curves, both of accuracy and the Brier score, are shown in Figure 3. 

For generating the learning curves we fix the number of bins used in the cal- 
ibration procedure at 20, and average the results measured on the test set of 5 
trials for each point on the curve. We see that for the Naive Bayes classifier, cali- 
bration improves accuracy and the Brier score early on the curve (already at 200 
training samples), while for the already almost calibrated MoR, the calibration 
procedure does not produce a significant benefit to performance until fairly large 
training sets are available. Observing the changes in the Brier score, it appears 
that both models achieve near convergence to a calibrated classifier as early as 
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after 1000 samples. It is also important to note that for the MoR and sample 
sizes smaller than 400, the calibration procedure slightly degrades performance 
because of overfitting. These experiments illustrate that models that are far from 
being calibrated benefit from calibration even with few data; for classification, 
any change in the decision boundary in the right direction has a large effect. 
However, models that are close to being calibrated are more sensitive to noise 
in the calibration map, and are more prone to overfitting with small data sets. 
We discuss possible ways to overcome these effects in the summary. 




(a) Classification accuracy 




(b) Brier score 



Fig. 3. Learning curves of Naive Bayes and MoR for the I/O prediction data. 



6 Summary 

In this paper we characterize the mathematical relation between calibration and 
bounds on the Bayes error and the use of calibration to find thresholds in the 
decision rule minimizing a classifier’s error. These theoretical results, coupled 
with mounting empirical evidence in the literature, illustrate the importance 
and value of calibrating classifiers for classification and decision making. 

The result relating calibration with the decision rule that minimizes the clas- 
sification error, produces an effective procedure for finding optimal thresholds in 
this decision rule. This also establishes a direct relationship with ROC curves, 
a relationship which was informally alluded to in [6], and is formalized in this 
paper. As with any learning algorithm, finite sample effects have to be consid- 
ered; our learning curve experiments show that a simple calibration procedure 
performs well with large training sets, but can cause overfitting with small train- 
ing sets. Reducing the possibility of overfitting can be done by smoothing of the 
calibration map or estimating a smooth function (such as the sigmoid) as the 
calibration map [5] . As a note, the number of thresholds on the decision function 
would depend on the smoothing function used, e.g., with a sigmoid, only one 
threshold can be found, which might not be always desirable. We also observe 
that calibration is more beneficial, even at small sample sizes, for classifiers that 
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are inherently not calibrated (such as Naive Bayes), compared to calibration of 
classifiers that are more naturally calibrated (such as logistic regression) . 

Future work includes providing bounds on how the estimation error of the 
calibration map affects the estimation of the optimal thresholds for classifica- 
tion and the payoff in terms of decision making. Such bounds could help avoid 
overfitting, especially with small sample sizes. Extending the method beyond 
binary classification problems is another research direction; similar to methods 
extending ROC curves beyond binary classification [7]. We are also exploring the 
use of calibration in semi-supervised learning, helping eliminate the possibility of 
performance degradation when using unlabeled data to learning classifiers, a phe- 
nomenon that occurs with biased models that output uncalibrated a-posteriori 
probabilities [12]. 
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Abstract. We propose a novel methodology for clustering XML docu- 
ments on the basis of their structural similarities. The idea is to equip 
each cluster with an XML cluster representative , i.e. an XML document 
subsuming the most typical structural specifics of a set of XML docu- 
ments. Clustering is essentially accomplished by comparing cluster repre- 
sentatives, and updating the representatives as soon as new clusters are 
detected. We present an algorithm for the computation of an XML rep- 
resentative based on suitable techniques for identifying significant node 
matchings and for reliably merging and pruning XML trees. Experimen- 
tal evaluation performed on both synthetic and real data shows the ef- 
fectiveness of our approach. 



1 Introduction 

As the heterogeneity of XML sources increases, the need for organizing XML 
documents according to their structural features has become challenging. In such 
a context, we address the problem of inferring structural similarities among XML 
documents, with the adoption of clustering techniques. This problem has several 
interesting applications related to the management of Web data. For example, 
structural analysis of Web sites can benefit from the identification of similar 
documents, conforming to a particular schema, which can serve as the input for 
wrappers working on structurally similar Web pages. Also, query processing in 
semistructured data can take advantage from the re-organization of documents 
on the basis of their structure. Grouping semistructured documents according to 
their structural homogeneity can help in devising indexing techniques for such 
documents, thus improving the construction of query plans. 

The problem of comparing semistructured documents has been recently inves- 
tigated from different perspectives [5,14,4,3,8]. Recent studies have also pro- 
posed techniques for clustering XML documents. [7] describes a partitioning 
method that clusters documents, represented in a vector-space model, accord- 
ing to both textual contents and structural relations among tags. The approach 
in [13] proposes to measure structural similarity by means of an XML-aware 
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edit distance, and applies a standard hierarchical clustering algorithm to evalu- 
ate how closely cluster documents correspond to their respective DTDs. 

In our opinion, the main drawback of the above approaches is the lack of 
a notion of cluster prototype , i.e. a summarization of the relevant features of 
the documents belonging to a cluster. The notion of cluster prototype is crucial 
in most significant application domains, such as wrapper induction, similarity 
search, and query optimization. Indeed, in the context of wrapper induction, the 
efficiency and effectiveness of the extraction techniques strongly rely on the ca- 
pability of rapidly detecting homogeneous subparts of the documents under con- 
sideration. Similarity search can substantially benefit from narrowing the search 
space. In particular, this can be achieved by discarding clusters whose proto- 
types exhibit features which do not comply with the target properties specified 
by a user-supplied query. 

To the best of our knowledge, the only approach devising a notion of cluster 
prototype is [11]. Indeed, the authors propose to compare documents accord- 
ing to a structure graph, s-graph , summarizing the relations between elements 
within documents. Since the notion of s-graph can be easily generalized to sets 
of documents, the comparison of a document with respect to a cluster can be 
easily accomplished by means of their corresponding s-graplrs. However, a main 
problem with the above approach relies on the loose-grained similarity which 
occurs. Indeed, two documents can share the same prototype s-graph, and still 
have significant structural differences, such as in the hierarchical relationship 
between elements. It is clear that the approach fails in dealing with application 
domains, such as wrapper generation, requiring finer structural dissimilarities. 

In this paper we propose a novel methodology for clustering XML documents 
by structure, which is based on the notion of XML cluster representative. A clus- 
ter representative is a prototype XML document subsuming the most relevant 
structural features of the documents within a cluster. The intuition at the core 
of our approach is that a suitable cluster prototype can be obtained as the out- 
come of a proper overlapping among all the documents within a given cluster. 
Actually, the resulting tree has the main advantage of retaining the specifics 
of the enclosed documents, while guaranteeing a compact representation. This 
eventually makes the proposed notion of cluster representative extremely prof- 
itable in the envisaged applications: in particular, as a summary for the cluster, 
a representative highlights common subparts in the enclosed documents, and can 
avoid expensive comparisons with the individual documents in the cluster. 

The proposed notion of cluster representative relies on the notions of XML 
tree matching and merging. Specifically, given a set of XML documents, our 
approach initially builds an optimal matching tree, i.e. an XML tree that is 
built from the structural resemblances that characterize the original documents. 
Then, in order to capture all such peculiarities within a cluster, a further tree, 
called a merge tree, is built to include those document substructures that are not 
recurring across the cluster documents. Both trees are exploited for suitably com- 
puting a cluster representative as will be later detailed. Finally, a hierarchical 
clustering algorithm exploits the devised notion of representative to partition 
XML documents into structurally homogeneous groups. Experimental evalua- 
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Input: A set <S = {ti, . . . , t n } of XML document trees; 

Output: A cluster partition "P = {C 1, . . . , Cfc} of S. 

Method: 

let V := {Ci, . . . , C n }, where initially C% — {ti}; 
set ri := ti as the representative for Ci ; 

compute a tree-distance matrix where Md(i,j ) = d(ti,tj ); 

repeat 

choose clusters Ci and Cj such that d(rep(Ci) , rep(Cj)) is minimized; 
compute the representative r — rep(n, rj) for cluster C — Ci U Cj ; 
set V := V — {Ci, Cj} U {C}, and update 

until 'P has maximal quality; 



Fig. 1. The XRep algorithm for clustering XML documents. 



tion performed on both synthetic and real data states the effectiveness of our 
approach in identifying document partitions characterized by a high degree of 
homogeneity. 

2 Problem Statement 

Clustering is the task of organizing a collection of objects (whose classification is 
unknown) into meaningful or useful groups, namely clusters , based on the inter- 
esting relationships discovered in the data. The goal is grouping highly-similar 
objects into individual partitions, with the requirement that objects within dis- 
tinct clusters are dissimilar from one another. 

Several clustering algorithms [10] can be suitably adapted for clustering 
semistructured data. We concentrate on hierarchical approaches, which are 
widely known as providing clusters with a better quality, and can be exploited to 
generate cluster hierarchies. Fig.l shows XRep , an adaptation of the agglomer- 
ative hierarchical algorithm to our problem. Each XML tree (derived by parsing 
the corresponding XML document) is initially placed in its own cluster, and a 
pair-wise tree distance matrix is computed. The algorithm then walks into an 
iterative step in which the least dissimilar clusters are merged. As a consequence, 
the distance matrix is updated to reflect this merge operation. The overall pro- 
cess is stopped when an optimal partition (i.e. a partition whose intra-distance 
within clusters is minimized and inter-distance between clusters is maximized) 
is reached. In this paper, we follow the approach devised in [9], and address the 
problem of clustering XML documents in a parametric way. More precisely, the 
general scheme of the XRep algorithm is parametric to the notions of distance 
measure and cluster representative. 

Concerning the distance measure, we choose to adapt the Jaccard coeffi- 
cient [10] to the context of XML trees. A first measure can be straightforwardly 
defined by considering the feature space representing the set of labels (i.e. tag 
names) associated with the nodes in a tree: if we denote with tag(t) the set of 
tag names for a tree t, then we define as d^\t\,t 2 ) = 1 — the 

Jaccard distance between two trees t\ and t 2 ■ An alternative (and more refined) 
definition is given by taking into account the paths in the trees rather than only 
the node labels. More precisely, df(t 1 ,t 2 ) = 1 - ma j (SSI ) I } ’ where 
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path(ti) denotes the set of paths in U, and path(t{)C\path(t 2 ) is the set of common 
paths between t\ and t, 2 - 

Intuitively, the representative of a cluster of XML documents is a document 
which effectively synthesizes the most relevant structural features of the docu- 
ments in the cluster. The notion of representative can be formalized as follows. 

Definition 1. Given a set U, equipped with a distance function d :U xlA i— > ]R, 
and a set S = {ti, ... , t n } CIA of XML document trees, the representative of S 
(denoted by rep{S)) is the tree t* that minimizes the sum of the distances: 

t* = rep(S) € U 4=> t* = argmin t £u f (t) 
where f(t) = Yh=\ d(L,t). □ 

The computation of the representative of a set turns out to be a hard prob- 
lem if the above distance measures are adopted. Therefore we exploit a suitable 
heuristic for addressing the above minimization problem. Viewed in this respect, 
our goal is to find a lower-bound-tree and an upper-bound-tree for the optimal 
representative. The lower-bound-tree (resp. upper-bound-tree) is a tree on which 
any node deletion (resp. node insertion) leads to a worsening in function /. Thus, 
a representative can be heuristically computed by traversing the search space 
delimited by the above trees. Two alternative greedy strategies can be devised: 
either a growing approach, which iteratively adds nodes to the lower-bound, or 
a pruning approach, which iteratively removes nodes from the upper-bound. In 
the following, we will denote the lower-bound-tree and the upper-bound-tree 
as optimal matching tree and merge tree, respectively. Notice that the optimal 
matching tree represents a stopping condition for the pruning approach, whereas 
the merge tree is always a sub-optimal solution since it contains the optimal rep- 
resentative. Dually, the merge tree defines a stopping condition for the growing 
approach, whereas the optimal matching tree is a sub-optimal solution since it 
is contained in the optimal representative. 

We develop a pruning approach in which the computation of an XML cluster 
representative consists of the following three main stages: the construction of an 
optimal matching tree, the computation of a merge tree, and the pruning of the 
merge tree. Fig. 3 sketches an algorithm which has been developed according to 
the above three stages. 

3 Mining Representatives from XML Trees 

We give now some definitions which are at the basis of our approach. A tree t is 
a tuple t = (r t ,Vt,E t , X t ) where V) C IN is the set of nodes, E t C V t x V t is the 
set of edges, r t is the root node of t, and At : Vt e- > E is a node labelling function 
where E is an alphabet of node labels. In particular, we say that an XML tree 
is a tree where E is an alphabet of element tags. Moreover, let depth t (v ) denote 
the depth level of node v in t, with depth t (rt) = 0, and let path t (v) = (v^ = 
rt,Vi 2 , . . . ,Vi = v) denote the list of p nodes that lead up to the node v from 
the root r t . 
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Fig. 2. (a) Strong and (b) multiple matching nodes, and (c) their trees. 



Definition 2 (strong matching). Given two trees fa and fa, and two nodes 
t £ Vti , w £ Vt 2 , a strong matching match(v,w) between v and w exists if 
Ati (vi) = Xt 2 (wi) and depth t {vf) = depth t2 (wi), for each pair of nodes {vi,wf) 
such that Vi £ path tl {v) and Wi £ path t2 (w). □ 

The above definition states that two nodes, v and w, have a strong matching 
if v and w together with their respective ancestors share both the same label 
(i.e. tag name) and depth level. Fig. 2(a) displays an example of strong matching 
among the colored nodes. 

The detection of matching nodes between two trees allows the construction 
of a new tree, called a matching tree, which resembles the intersection of the 
original trees. 

Definition 3 (matching tree). Given two trees fa and 1 2 , a tree t = (r m ,V m , 
Em,X m ) is a matching tree, denoted by t = match{fa,fa), if the following con- 
ditions hold: 

1. there exist two mappings fa : t ti and fa '■ t 1 — > t 2 associating nodes and 
edges in t with a subtree in fa and fa; 

2. for each u £ V m , there exists a strong matching between v = f\{u) and 
w = fa (u) (i.e. match(v,w) holds); moreover, A m (u) = Atj(u) = Xt 2 (w); 

3. /i(r m ) = r tl , and fa(r m ) = r t2 ; moreover, for each e = (u,v) £ E m , /i(e) = 

(fi(u),fi(v)) and fa(e) = (fa(u),fa(v)). □ 

Notice that, in general, multiple matchings may occur when a node in a tree 
has a matching with more than one node in a different tree. More formally, given 
two trees ti and ti, a node v £ V h has a multiple matching if 3w',w" £ V) 2 
such that both match{v,w') and match{v,w") hold. An example of multiple 
matching between nodes in two trees is shown in Fig. 2(b). Multiple matchings 
trigger ambiguities in defining matching trees: Fig. 2(c) represents two alternative 
matching trees for the documents in Fig. 2(b). 

3.1 XML Tree Matching 

In order to capture as many structural affinities as possible, we are interested 
in finding matching trees with maximal size. Formally, a matching tree t m = 
match{ti,fa) is an optimal matching tree for two trees t\ and fa if there not 
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Input: 

An XML tree r\ = (r ri , V ri , E r x , A ri ) as representative of cluster C \ , and 
an XML tree r-2 = (ty 2 , W 2 , E r2 , A r2 ) as representative of cluster C2 ■ 

Output: 

An XML tree rep as representative of cluster C = C\ U C 2 • 

Method: 

compute the matching matrix M m , with size (| V ri I X \Vr 2 I); 

compute the marking vectors V mi , V^ 2 , where V m 1 .size = \V r ^ | and Vm 2 ■ size = \V r2 |; 
set mi := |{i>i G V ri \ V mi [*] 7^ — 1 }|, and m2 := |{i>i G V^- 2 | VWi 2 [*] 7^ — 1 } | ; 
if (mi > m2) 

match := buildMatch(ri , r2, V m j , Vm 2 )> merge := buildMerge(ri , r2, V mi , Vm 2 )i 

else 

match := buildMatch(r2 , ri, V m 2 > V mi ); merge := buildMerge(r2, ri, Vm 2 > Kr>i); 

rep := prune(Ci U O2, merge, match)] 
return rep; 

Function buildMatch(ti, t2, V m 1 , V m2 ) '■ 
t := ti; 

for each v* G , V mi [i] = — 1 do 

remove{t,Vi)\ /* removes the subtree rooted at Vi from t */ 

let ij = {i> il ,...,vi h g v tl | v mi M =3, v e [i-fc]}; 

for each Ij do 

removeDuplicatesft, Ij)\ /* removes duplicated paths from t */' 

return t; 

Function buildMerge(ti, t2, V mi , Vm 2 ) • t] 
t := ti; 

for each Vi G Vt x do 

let J = {wj,,.. -,w jh € V t2 | V m 2 [j p ] = i, pE [l..h]}; 
let v G Vt 1 such that (v,Vi) G Et 1 ] 

insert(t,v,Vi, |J| — 1 ); /* inserts node Vi as a child of v into t, |J| — 1 times */ 

for each Wi G V t2 , V m2 [i] = — 1 do 

let Wj G Vt 2 such that (wj , Wi) G Et 2 , and Vh G Vt 1 such that Vm 2 \j\ = h] 

insert(t, Vh,Wi)] /* inserts node Wi as a child of v^ into t */ 

return t; 

Function prune (C, t, t') : r; 
set r := t] 

do 

let C C V r be the set of leaf nodes in r; 
compute do := d(i, r); 

for each vi G £ do 

compute r^ : = removeLeaf{r, vi)] 
l* = argmin*, E t€C d(t,r (l) )] ; 
set d* := 2 t€C d ( £ » r(i * ) ); 
if (d* < d 0 ) 
r := A l * ) ] 

while d* < do and V r C V t / ; 
return r; 



Fig. 3. The algorithm for the computation of an XML cluster representative. 



exists another matching tree t' m = match{t\,tf) ^ t m such that |ht m | > \V t ’ m \. 
We describe a dynamic-programming technique for building an optimal matching 
tree from two XML trees. The technique consists of three steps: i ) detection of 
matching nodes, ii) selection of best matchings, and in) optimal matching tree 
construction. 

Matching detection. Given two trees ii and 1 2 , the detection of matching nodes 
is performed building a (|V^| x |Vt 2 |) matching matrix M m . In this matrix, 
the generic (i,j)- th element corresponds to nodes v t £ Vt, and Wj £ Vt 2 , and 
contains a weight aj m (vi, Wj) to be associated with the matching between u, and 
Wj. Initially, the weight is 1 if match(vi,Wj) holds, and 0 otherwise. In order to 
ease the construction of the matching matrix, nodes are enumerated by level, 
thus guaranteeing a particular block structure for M m . Indeed, for each level 
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(a) Examples XML trees ti and t 2 





(b) Matching matrix 



(c) Matching selection 



Fig. 4. Data structures for the construction of an optimal matching tree. 



k , a sub-matrix M m (k) collects the matchings among the nodes in t\ and t 2 
with depth equal to k. Fig. 4(a) displays two example XML trees with numbered 
nodes. The corresponding matching matrix is shown in Fig. 4(b). 

Selection of best matchings. The problem of multiple matchings can be addressed 
by discarding those matchings which are less relevant according to the weighting 
function Lo m . The weight uj m (v , w), associated to two matching nodes v G Vt 1 and 
w G V t2 , is computed by taking into account the matches between the children 
nodes of both v and w. Formally, uj m (v, w) = 1 + JT j u>j), where nodes 

Vi,Wj are such that (v,Vi) G E tl and (w,Wj) G E t2 . Fig. 4(c) shows the weights 
associated with each possible node pair. 

Multiple matchings relative to any node of t\ (resp. t 2 ) can be detected 
by checking multiple entries with non-zero values within the corresponding row 
(resp. column) of M m . We now describe the process for detecting multiple match- 
ings. In the following we focus on the identification of nodes within t\ that have 
multiple matchings with those in t 2 : the dual situation (i.e. identification of 
nodes in t 2 having multiple matching with nodes in t\) has a similar treatment. 

Let Vi G V tl denote the node corresponding to the i-th row in M m , and 
let J Vi = \ ji . . . . ,jh\ be the set of column indexes, corresponding to the nodes 
Wj 1 , . . . : Wj h of t 2 , such that M m (i,jk) > 0 (i.e. such that Lv m (vi,Wj k ) > 0), k = 
[l../i]. Thus, Vi exhibits multiple matchings if J Vi > 1. For each node Vi G V tl , 
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Fig. 5. (a) Lower-bound (optimal matching tree), (b) upper-bound (merge tree), and 
(c) optimal representative tree relative to the trees of Fig.4(a). 



the best matching node corresponds to the column index j*. = argmaxj^...^ 
{M m (i,j i ), . . . If the maximum in {M m (i,j i), . . is not 

unique we choose j*. to be the minimum index. The overall best matchings for 
nodes of ti can be easily tracked by using a marking vector V mi = {j* , . . . , j* }, 
whose generic i-th entry indicates the node of £2 with which £ Vq has the best 
matching. We set V mi [i] = —1 if the node Vi £ V tl has no matching. Fig. 4(c) 
shows the marking vectors V mi and V m2 associated with 1 1 and £ 2 , respectively. 
Optimal matching tree construction. An optimal matching tree is effectively 
built by exploiting the above marking vectors: it suffices that all nodes with no 
matching are discarded. Fig. 5(a) shows the optimal matching tree computed for 
ti and t 2 of Fig. 4(a). As we can see in the figure, the optimal matching tree is 
obtained from t\ by removing nodes 2,5,8,11. 

3.2 Building a Merge Tree 

The optimal matching tree of two documents represents an optimal intersection 
between the documents. The notion of merge tree resembles an optimized union 
of the original trees. Notice that, firstly an optimal matching tree has to be de- 
tected, in order to avoid redundant nodes to be added. Indeed, a trivial merge 
tree could be simply built as the union of the trees under investigation. Func- 
tion buildMerge in Fig. 3 details the construction of a merge tree, which takes 
into account nodes discarded while building the optimal matching tree. To this 
purpose, given two trees t\ and 1 2 , we first consider nodes in t\ having duplicate 
nodes, and insert such duplicates into the merge tree. Next, nodes in t 2 which 
do not match with any node in t\ are added. 

Fig. 5(b) shows the merge tree associated to the trees of Fig. 4(a). Nodes 8, 11 
from t\ and 9, 10, 11 from £2 have no matching, whereas nodes 2, 5 from t\ and 
8 from t 2 exhibit multiple matchings. 
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3.3 Turning a Merge Tree into a Cluster Representative 

An effective cluster representative can be obtained by removing nodes from a 
merge tree in such a way to minimize the distance between the refined merge tree 
and the original trees in the cluster. Procedure prune, shown in Fig. 3, iteratively 
tries to remove leaf nodes until the distance between the refined merge tree and 
the original trees cannot be further decreased. It is worth noticing that, on the 
basis of the definition of procedure prune, the representative of a cluster is always 
bounded by the optimal matching tree built from the documents in that cluster. 
The correctness of the pruning procedure is established by the following result. 

Theorem 1. Let ti,i 2 be two XML trees. Moreover, let tM = merje(ti, t 2 )> 
t m = match(t\,t 2 ) and t* = rep({ti, t 2 })- Then, t m C t* C f M . □ 

Let us consider again the trees t\ and £2 of Fig. 4(a) and their associated merge 
tree merge(ti,t 2 ) of Fig. 5(b). Suppose that t\ and f 2 belong to the same cluster 
C . In order to compute the representative tree for C, the pruning procedure 
is initially applied to the set of leaves C = {5, 8, ..., 12, 14, 15}. If we choose to 
adopt the dj ; distance, the procedure computes an initial intra-cluster distance 

= 5/8. This distance is reduced to 4/7 as leaf node 14 is removed. Yet, 
do can be decreased by removing node 12. Since at this point no further node 
contributes to the minimization of d$ , the pruning process ends. Fig. 5(c) shows 
the cluster representative resulting from pruning the merge tree in Fig. 5(b), with 
the adoption of the dj ’ distance. 



4 Evaluation 

We evaluated the effectiveness of XRep by performing experiments on both syn- 
thetic and real data. In the former case, we mainly aimed at assessing the effec- 
tiveness of our clustering scheme with respect to some prior knowledge about the 
structural similarities among the XML documents taken into account. Specifi- 
cally, we exploited a synthetic data set that comprises seven distinct classes of 
XML documents, where each such class is a structurally homogeneous group 
of documents randomly generated from a previously chosen DTD. Tests were 
performed in order to investigate the ability of XRep in catching such groups. 

To the purpose of assembling a valuable data set, we developed an automatic 
generator of synthetic XML documents, that allows the control of the degree 
of structural resemblance among the document classes under investigation. The 
generation process works as follows. Given a seed DTD DTDo, a similarity thresh- 
old r, and a number k of classes, the generator randomly yields a set S* of k 
different DTDs, hereinafter called class DTDs, that individually retain at most r 
percent of the element definitions within DTDo. The k class DTDs are eventually 
leveraged to generate as many collections of conforming XML documents, on 
the basis of suitable statistical models ruling the occurrences of the document 
elements [8]. 
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The seed DTD was manually developed and exhibits a quite complex struc- 
ture. For the sake of brevity, we only focus on its major features. DTDo contains 
30 distinct element declarations that adopt neither attributes nor recursion. Non 
empty elements contain at most 4 children. Yet, the occurrences of such elements 
are suitably defined by exploiting all kinds of operators, namely +,*,?, and |. 
Finally, the tree-based representation of any XML document conforming to DTDo 
has a depth that is equal to 6. 

Each test on synthetic data was performed on a distinct set of seven class 
DTDs, sampled from DTDo, at increasing values of the similarity threshold r: we 
chose t to be respectively equal to 0.3, 0.5, and 0.8. 

Real XML documents were extracted from six different collections available 
on Internet: 

— Astronomy, 217 documents extracted from an XML-based metadata repos- 
itory, that describes an archive of publications owned by the Astronomical 
Data Center at NASA/GSFC. 

— Forum, 264 documents concerning messages sent by users of a Web forum. 

— News, 64 documents concerning press news from all over the world, daily 
collected by PR Web , a company that provides free online press release 
distribution. 

— Sigmod, 51 documents concerning issues of SIGMOD Record. Such docu- 
ments were obtained from the XML version of the ACM SIGMOD Web site 
produced within the Araneus project [6]. 

— Wrapper, 53 documents representing wrapper programs for Web sites, ob- 
tained by means of the Lixto system [2] . 

— Xyleme_Sample, a collection of 1000 documents chosen from the Xyleme’ s 
repository, which is populated by a Web crawler using an efficient native 
XML storage system [12]. 

The distribution of tags within these documents is quite heterogeneous, due to 
the complexity of the DTDs associated with the classes, and to the semantic 
differences among the documents. In particular, wrapper programs may have 
substantially different forms, as a natural consequence of the structural differ- 
ences existing among the various Web sites they have been built on: thus, the 
skewed nature of the documents in Wrapper should be taken into account. Also, 
documents sampled from Xyleme exhibit a more evident heterogeneity, since 
they have been crawled from very different Web sources. 

Clustering results were evaluated by exploiting the standard precision and 
recall measures [1]. However, in the case of Xyleme_Sample, we had no knowl- 
edge of an a-priori classification. As a consequence, we resorted to an inter- 
nal quality criterium that takes into account the compactness of the discov- 
ered clusters. More precisely, given a cluster partition V = {C i, . . . ,C„}, where 
Ci = {x\, . . . , x^.}, we defined an intra-cluster distance measure as TC(V) = 
£ E Ci eP XI d ( x > rep(Ci)). 

Table 1 summarizes the quality values obtained testing XRep on both syn- 
thetic and real data. All the experiments have been carried out by adopting the 
( 2 ) 

Jaccard distance dj introduced in Section 2. Tests on synthetic data evaluated 




A Tree-Based Approach to Clustering XML Documents by Structure 147 



Table 1 . Quality results. 



type 


docs 


avg size 


classes 


clusters 


T 




precision 


recall 


F-measure 


1C 


synth 


1400 


0.13KB 


7 


7 


0.3 




0.979 


0.978 


0.978 


0.219 


synth 


1400 


0.81KB 


7 


7 


0.5 




0.802 


0.909 


0.852 


0.304 


synth 


1400 


3.19KB 


7 


7 


0.8 




0.689 


0.773 


0.728 


0.369 


real 


649 


5.74KB 


5 


5 


- 




1 


1 


1 


0.208 


real 


500 


8.56KB 


- 


7 


- 




- 


- 


- 


0.376 


real 


1000 


9.42KB 


- 


9 


- 




- 


- 


- 


0.43 



the performance of XRep on three collections of 1400 documents (200 documents 
for each class DTD). Experimental evidence highlights the overall accuracy of 
XRep in distinguishing among classes of XML documents characterized by dif- 
ferent average sizes due to different choices for the threshold r. As we can see, 
XRep exhibits an excellent behavior for r = {0.3, 0.5}, while the acceptable 
performance reported on row 3 (i.e. r = 0.8) is due to the intrinsic difficulty in 
catching minimal differences in the structure of the involved XML documents. 
Indeed, two clearly distinct class DTDs, say DTD; and DTDj, may share a num- 
ber of element definitions inducing similar paths within the conforming XML 
documents. If such definitions assign multiple occurrences to the elements of 
the common paths, the initial class separation between DTD,: and DTDj may be 
potentially vanished by a strong degree of document similarity due to a large 
number of common paths in the corresponding XML trees. 

Tests on real data considered separately the first five collections (649 XML 
documents with an average size that is equal to 5.74KB), and the Xyleme_Sample 
collection. In the first case, XRep showed amazingly optimal accuracy in iden- 
tifying even latent differences among the involved real documents. As far as 
Xyleme_Sample is concerned, we conducted two experiments (rows 5 and 6 in 
Table 1), where in the first one we considered only one and a half of the dataset. 
However, as we expected, in both cases intra-cluster distance provides fairly 
good values: this is mainly due to the high heterogeneity which characterizes 
documents in Xyleme_Sample. 

5 Conclusions and Further Work 

We presented a novel methodology for clustering XML documents, focusing on 
the notion of XML cluster representative which is capable of capturing the signifi- 
cant structural specifics within a collection of XML documents. By exploiting the 
tree nature of XML documents, we provided suitable strategies for tree matching, 
merging, and pruning. Tree matching allows the identification of structural simi- 
larities to build an initial substructure that is common to all the XML document 
trees in a cluster, whereas the phase of tree merging leads to an XML tree that 
even contains uncommon substructures. Moreover, we devised a suitable prun- 
ing strategy for minimizing the distance between the documents in a cluster and 
the document built as the cluster representative. The clustering framework was 
validated both on synthetic and real data, revealing high effectiveness. 
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We conclude by mentioning some directions for future research. The approach 
described in the paper has to be considered an initial approach to clustering 
tree-structured XML data. Further notions of cluster representative can be in- 
vestigated, e.g. by relaxing the requirement that a prototype corresponds to a 
single XML document. Indeed, there are many cases in which a collection of 
XML documents is better summarized by a forest of subtrees, where each sub- 
tree represents a given peculiarity shared by some documents in the collection. 
A typical case raises, for instance, when the collection has an empty matching 
tree, and still exhibits significant homogeneities. 
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Abstract. In this paper, we introduce a new approach for mining regulatory in- 
teractions between genes in microarray time series studies. A number of prepro- 
cessing steps transform the original continuous measurements into a discrete rep- 
resentation that captures salient regulatory events in the time series. The discrete 
representation is used to discover interactions between the genes. In particular, 
we introduce a new across-model sampling scheme for performing Markov Chain 
Monte Carlo sampling of probabilistic network classifiers. The results obtained 
from the microarray data are promising. Our approach can detect interactions 
caused both by co-regulation and by control-regulation. 



1 Introduction 

In bioinformatics, we are faced with an increasing amount of data that characterize the 
structure and function of different living organisms. Still more experimental data such 
as sequences (nucleotides, proteins) and gene activities (mRNA expression ratios) are 
generated either in the biology laboratory or in a clinical setting. The ever-expanding 
datasets fuel a growing demand for new datamining techniques that can help to discover 
possible relations between the biological entities under study and couple the different 
sources of data. Such datamining techniques should be able to cope with many vari- 
ables that may exhibit complex dependency relations. We present a new cross-model 
sampling Markov Chain Monte Carlo algorithm, which we test by learning Bayesian 
network classifiers to predict regulatory relations between a set of predictor genes and 
a target gene. 

Microarrays were introduced in the nineties as a means for studying in parallel the 
expression of all genes pertaining to a particular organism. One of the ultimate goals is 
to discover which genes are involved in the regulation of others, the so-called regulatory 
pathways. Microarrays measure the relative abundance of mRNA, corresponding to 
each known gene transcribed at a certain time t in a particular organism under study. 
So the prospect of microarrays is that of an aid that can help to identify functional roles 
of genes and eventually enrich the knowledge of the complex relations between the 
genotype and the phenotype of the organism under study. 

Microarray time series experiments are conducted in order to study significant dy- 
namic expression patterns. One goal of a time series experiment is to investigate which 
genes regulate others. It is to be expected that some genes that are controlled by the 

J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202, pp. 149-160, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 




150 Michael Egmont-Petersen, Wim de Jonge, and Arno Siebes 



same transcription factor show a similar but lagged expression pattern over time, when 
the expression of the particular transcription factor varies. We make a distinction be- 
tween co-regulation and controlled regulation. Two genes are said to be positively co- 
regulated when the change in relative abundance of the genes has the same first- order 
derivatives with respect to time. Two genes are said to be inversely co- regulated when 
the change in relative abundance of the genes has the opposite first-order derivatives 
with respect to time. Two perfectly co-regulated genes can have expression patterns 
with different amplitudes. One or more genes (the regulators) are said to control the 
expression of a particular gene (the target) when the expressions of the regulator genes 
directly influence the expression of the target gene. 

Under conditions where particular genes are co-regulated or one or more regulator 
genes control the expression of a target gene, one would expect co-variation between 
the expressions of these genes over time. Our goal is to develop a datamining approach 
that can discover dynamic patterns of co-regulation and control regulation between sets 
of genes. Clustering techniques and correlation measures have been used extensively 
to identify groups of genes that are likely to be functionally related, see, e.g., Datta 
& Datta for an overview [1], However, the standard clustering techniques do not take 
post-transcriptional and post-translational lag times into account. More importantly, in 
mining the vast amount of time series array data for putative control regulation relations, 
lags between expression levels of genes may contain indicative clues as to which genes 
code for proteins that act as regulators for others. 

In this article, we present a novel datamining method for finding possible regulatory 
relations between small sets of genes, based on time-course microarray data, see, e.g., 
[2], In the sequel, we regard the normalized (relative) expression levels of each gene 
as a time signal. We introduce preprocessing steps that transform such a time signal 
into ’’salient features”, points in time that may disclose possible lagged interactions 
between genes. From this discrete representation, we train dynamic Bayesian networks 
to predict regulatory events of specific target genes using a novel MCMC-approach. 
Our new method is evaluated on microarray data obtained from the experiments by 
Spellman et al. [2]. The results are promising. Most of the regulatory relations found 
could be corroborated by literature. 

2 Microarray Data 

Our goal is to discover and interpret statistical relations between the relative expression 
of genes. For that purpose, we need to choose a suitable representation scheme for time 
series microarray data. Generally, each spot indicates the average relative (log) expres- 
sion of mRNA corresponding to a particular gene /?,. The expression ratio of gene II, 
can be seen as a continuous stochastic variable, characterized by the probability den- 
sity function p(Ri). Each variable Ri can either be a predictor or a target, relative to the 
other variables entering the model. We use t to denote the time step at which variable II, 
is being measured and discretize the arraydata Ri(t), t € {to, . . . , tx} into the follow- 
ing three categories: change, local minimum and local maximum. This differs from the 
approach by others [3,4], who make a distinction between up-regulated, medium regu- 
lated and down-regulated gene expression. Our representation is different in the sense 
that it combines successive expression ratios Ri(t v - 1 ), Ri(t v ) and R(t v + 1 ) into fea- 
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Array time signal Linear interpolation Max-min filter 




Fig. 1. In total six preprocessing steps are performed before our new datamining algorithm is 
applied to the dataset: 1) the log ratios of each gene are computed, 2) linear interpolation results 
in uniformly sampled log expression ratios, 3) the (optional) max-min filter removes transient 
extrema, 4) convolution with the first-order derivative of the Gaussian function results in deriva- 
tives of the expression ratios, 5) the local extrema are defined as time points at which the sign 
of the first-order derivative changes, 6) the selected target gene is coded as a binary variable by 
duplication, 7) MCMC-learning of local genetic networks. 



tures that capture the local dynamics (local extrema) of the expression ratios. With our 
representation, relations are discovered between the most likely time points at which a 
gene (eventually its associated protein) is active (local maximum) and inactive (local 
minimum). Our approach makes it possible to establish a regulatory relation between a 
transcription factor with small absolute changes in expression ratio, and a target gene, 
because the amplitude is disregarded. 

The preprocessing steps consist of 1) computation of the log-ratio per gene, 2) linear 
interpolation, 3) max-min filtering (optional) and 4) detection of local minima and local 
maxima using the derivative operator from the linear scale space. In the steps 5) and 6), 
the local extrema are identified and the number of observations doubled. 

2.1 Interpolation 

Computation of the derivatives over each gene entails the application of (linear) filters. 
Filtering requires that the signal be uniformly sampled over time. We use a linear nearest 
neighbor scheme to interpolate non-uniformly sampled time series because this scheme 
can never introduce new local minima or maxima. Interpolation results in a uniformly 
sampled time series t, t £ {1, . . . , T} of expression rations, Ri(t) for gene i. 

2.2 Max-Min Filter 

To cope with transient changes in the first order derivative as a result of noise, we 
incorporate an extra (optional) preprocessing step consisting of the morphological max- 
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min filter [5]. An advantage of the max-min filter is that non-transient extrema in the 
original signal are left unaffected. The max-min filter is defined as 



m 



max 

t!eb(t) 




min 

t 2 eb(t 1 ) 



m2)) 



+ min 

ilSb(t) 

"2 




max (R(t 2 )) 
*2Gfc(*l) 



(1) 



When the width of the window b(t) exceeds zero, small inflections become saddle 
points, otherwise K(t) = R(t). 



2.3 Regularized Differentiation 

Transformation of the continuous expression ratios A'j(f) into the desired discrete rep- 
resentation: change, local minimum and local maximum, requires the computation of 
derivatives, dKi(t) /dt. We use operators from the linear scale space [6, 7] to transform 
differentiation into a well-posed problem [8] by means of regularization. Regularized 
derivatives of a discrete time series are obtained by convolution with the first-order 
derivative of the Gaussian function 

9 '^ ^ a) = 6XP (-^2^) (2) 
Convolution with gi results in 

/ OO 

g'{ T \ 0, <t) ■ K(t — t) dr (3) 

-OO 

When the sign of H(t) changes between two consecutive time steps, H(t — 5) <0 but 
H(t + 6) > 0, this indicates a local minimum whereas H(t — 5) >0 but H(t + 6) <0 
indicates a local maximum. When there is no change in sign, the time step H(t ) gets 
the label change. 



2.4 Data Representation and Modeling 

Our goal is to identify possible co-regulatory and control-regulatory relations between 
sets of genes. With C(f?;, Rj) we indicate co-regulation between the genes Ii, and Rj, 
whereas T ( R, — > Rj) indicates that gene R, controls the regulation of gene Rj . An im- 
portant difference between co-regulation and controlled regulation is that co-regulation 
is a commutative relation, whereas controlled regulation is assumed not to be commu- 
tative. Consequently, the inclusion of lags in the time series should, in theory, make it 
possible to discern putative control regulations from co-regulations. The continuous val- 
ued variable H(t) is discretized by the function /. This results in a discrete time series 
per gene, x i)t = f(H i (t-S),H i (t+S)),withX i ^ t = x iit , x i>t G {min, change, max}. 

We propose to model possible gene interactions using dynamic Bayesian network 
classifiers. In the remaining part of the paper, we use X to indicate the set of predictor 
genes and C the target gene. Figure 2 indicates which types of relations may be found 
by our approach. We use a lagged time series model in the following way. At each time 
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Fig. 2. Significant regulatory events of the expression of a target gene are being predicted by 
regulatory events pertaining to other genes earlier and at the same time as the target. Thereby, 
both control regulation and co-regulation with predictor genes can be modeled. 



step t, the outcome of one target gene C t ,c t , should be predicted by the outcomes of the 
predictor genes x^ T , r £ {t — A, t} with A, A > 0, indicating the maximal lag that 
can be accounted for. This representation results in the following matrix of n = r x A 
(potential) predictor variables 



xi,t-\+\ • • • Xl ,t 

X2,t-\ ‘ ‘ ' X2,t 



( 4 ) 



•Er,t — A ‘ * %r,t 



from which the outcome of Ct is being predicted. The data (X t . C t ), f £ {1, . . . , T}, 
constitutes the basic training set. We use a datamining algorithm that performs concomi- 
tant feature and model selection in order to estimate the most likely lagged classifier 
model. Connections to features (possible regulator genes) that contribute to predicting 
the outcome of Ct are likely to be included in the model whereas genes that do not 
improve the predictive performance remain disconnected. 

For most of the genes, a local extremum occurs much less frequently than a change. 
Consequently, it is likely that many correlations appear between genes of which the 
expression changes. To ensure that only local minima and local maxima of the target 
gene are being predicted, we choose to reduce the number of possible outcomes of the 
target variable to just two: local minimum and local maximum. To retain the three basic 
outcomes, we double the number of observations resulting in a final training set D 

- ( x t , Ct = max), becomes d s = ( x t , Ct = max ) and d s+ 1 = ( x t , Ct = max ) 

- ( x t , Ct = min), becomes d s = ( x t , Ct = min) and d s+ \ = ( x t , Ct = min) 

- (x(t), ct = change), is doubled into an ambiguous prediction of Ct, d s = ( x t , Ct = 
min) and d s+ 1 = ( x t , Ct = max) 

with s = 1, . . . , 2 T — 1. Note that we do not add any information to D that is not 
included in the original data. Only the number of observations doubles, a fact that can 
easily be accounted for if one wants to estimate, e.g., the variance of the model outcome. 
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3 Dynamic Bayesian Network Classifiers 



Before describing in detail how to build dynamic Bayesian classifiers, we briefly con- 
sider previous work. Friedman [9] pioneered with his Bayesian network approach to 
modelling gene interactions. A separate variable indicates the cell cycle phase, i.e,, time. 
Husmeier [3] used a dynamic Bayesian network to model the lagged relations between 
genes using the likelihood of the graph as a scoring metric. Husmeier acknowledges 
the problem imposed by the limited size of available microarray time series. As earlier 
stated, we choose to predict the change in expression of individual genes by a classifier. 

A probabilistic network classifier M = (G, 0) consists of a structural model specifi- 
cation, the directed graph G, and the parameters, 6 , with the (un)conditional probability 
@i,j,n(i) = P(Di = dj | n(D t ) = d^ojA). The notation n(Di) = indicates the 

values of the parents of node I), in the graph G (the parents constitute the nodes with 
arcs pointing directly to node Di). Computation of the posterior probability distribution 
P(C\X) is specified by the directed graph. It follows from the chain rule that the joint 
probability P(d) = P(c. x) is computed from 

fc+i 

P{d) = l[ P(D t = dj | 7 r(A) = d v(Di )) (5) 

i=l 



A little manipulation of Bayes formula yields the posterior probability associated with 
class label Cj 



P(Cj\x) 



P(cj,x) 
Em P{ c m, x) 



( 6 ) 



3.1 Learning Probabilistic Network Classifiers 

Probabilistic network classifiers [10] have to be learned from a dataset D. In the past, 
complete graphical models have successfully been learned using the approach intro- 
duced by Madigan & York [11]. However, their version of the MCMC-algorithm is 
not appropriate for learning probabilistic network classifiers, because it samples com- 
plete graphs drawn from the conditional distribution P(G | D). Instead, we introduce a 
novel Markov Chain Monte Carlo technique based on the principles of Reversible Jump 
MCMC [12] to sample the posterior distribution probabilistic network classifiers. We 
make a simplification that leads to a less complex across-model sampling scheme than 
RJMCMC. Consequently, we can omit the Jacobian determinant term. 

Let the variables in the learning database D be separated into a set of predictor 
variables X and a classification variable C,D — (G, X). Our goal is to sample models 
from the following target distribution P(L(C) \ D ), with L(C ) a score function (also 
called loss function [13]) 

P(L{C)\X,G,Q*,D)= Yl l(P(c\x,G,e*,d)-,v,j) (7) 

deD 



with l the modified step function 
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y < \ ~ v 

\y-\ \<v ( 8 ) 

7 : y > 5 +v 

for which it holds that l(C = c) £ (0,1), 5Z) e = c ) = 1- The modified step 
function has two parameters, the span of indeterminacy v and the bounding probability 
7 . The parameter v determines the range of posterior probabilities regarded as ties, 
resulting in an intermediate score. The bounding probability 0.5 > 7 > 0 determines 
the gain or loss obtained by classifying a case correctly or wrongly, respectively. The 
score L(C ) can be considered a genuine probability, similar to the likelihood PiD \ G) 
applied by Madigan & York. The distribution P(L(C ) | D) cannot be sampled directly, 
hence we perform the following factorization yielding a hierarchical Bayesian model: 

P{L{C) | D) = 

Eg P(L(C)\X,G,D) (9) 

P(G | X)P(X | k,D)P(k | D) 

with G the directed graph, X the observations corresponding to the subset of selected 
predictor variables, and k the number of selected predictor variables. Computation of 
P(L(C) | X, G, D ) requires a closed form solution to 

P(L(C) \ X,G,D)= f P(L(C)\X,G,0,D)P(0\X,G,D)d9 (10) 

J 0 

in which P(L(C) \ X, G , 9 , D) is the probability of the score L ( C) , given the param- 
eter vector 0, the data associated with the predictor variables X, the acyclic graph G 
and the database D. As no closed form is presently available, we suggest to use instead 
P(L(C) | X . G . 9 . D) with 9* the maximum-likelihood estimate of the parameter 
vector 1 . Note that the model G does not change as a function of 9. Since G does not 
depend on 9, conditioning on 9* , the most likely parameter vector, will not strongly 
bias the estimate of P(L(C ) | X . G, D ). However, this approximation necessitates the 
use of a regularization prior. The following derivation is based on work presented else- 
where [14]. The variance of ln(L(C')) equals the sum of the variances of ln(((atj); v, 7 ), 
pertaining to the individual cases i 

<( P(L(C) ) ) = ( ln Q + 7) - In Q - 7) ) ~ P* ) (1 

The probability p, is in fact the probability per case that resampling the training set 
leads to the same winner resulting in l{xp, v, 7 ) = 0.5 + 7 . Conversely, 1 — pi is an 
error rate for a correctly classified case i. Consequently, we subtract f rom 

ln{P(L(C)\X,G,9*,D)}. 

1 This motivates our choice of score function in the first place. 



i(y,v, 7 ) = { b : 

l- 
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The Markov Chain Monte Carlo algorithm should preferably not be biased towards 
a certain number of features or model complexity. Hence, we propose to use a uniform 
prior P(k | D) on the size of the feature set k. For each feature set size k, each feature 
subset should be equally likely, so P(X \ k. D ) is also uniform. Finally, for a particular 
feature set, each possible model utilizing this feature subset should have the same prior, 
so P(G | X) is uniform. We define the one-step look ahead neighbourhood of the 
graph G consisting of the directed acyclic graphs of classifiers that can be constructed 
by adding one arc to G or deleting one arc from G. The neighborhood NBciG) is 
subdivided into four disjoint subsets 



NB C (G) = {NBc(G+1 f ),NB c {G - 1 F ),NB C (G + 1 M ),NB C (G - 1m)} (12) 



The subset NBciG + If) contains the graphical models in NBc(G ) where the addi- 
tion of an arc implies that G7 contains one feature variable more than G. The subset 
NBq{G — If) contains the models in NBciG ) where the deletion of an arc implies 
that 6V contains one feature variable less than G. The subset NBciG + 1m) contains 
the models in NBc (G) where the addition of an arc increases the complexity of G7, but 
where G and G/ include the same feature variables. NBciG — 1m) contains the models 
in NBciG) where the deletion of an arc decreases the complexity of G/, but where G 
and Gi include the same feature variables. Define the appropriate proposal distribution 

qc- 



qdG - Gf) = 



u<\ qi(\NB c (G + 1f)| -1 ) 

< \ <u< i 52 (|1 VBc(G , -1f)|- 1 ) 
\<U< I q^NBciG+lMT 1 ) 
J<u q^NBdG-lMT 1 ) 



(13) 



with u ~ G(0, 1). The proposals qi, q%, q% and q± result in a classifier pertaining to each 
of the four disjoint sub-neighborhoods, NB c iG + If), NBciG — If), NBciG + 1m) 
or NBciG — 1m), respectively. The proposal distribution qc implements the uniform 
priors, P(G | X), P(X | k,D) and P{k \ D). So in each proposal, the MCMC- 
algorithm with the same probability chooses to add a feature, delete a feature, increase 
the model complexity or simplify the model (the two latter moves keep the same feature 
subset). The resulting Metropolis-Hastings ratio becomes 



PiLiC) | X„ G q , 9*q,D) P q i iXq,k q ) (X, k) ) V 
PiLiC) I X,G,0*,D)P q (iX,k)^iX q ,k q ))V q 



with q indicating the new proposal, the regularization terms 



ln(Vg) _ —a Cr\ n (P(L(C)\X „■■■)) an( l — ~ a a \n{P(L(C)\X---))- 



The proposal probabilities, P 9 ( (X g , k q ) — » (X, k) ) and P 9 (X, k — » X q , k q ) correct 
for parts of the model space where one or more of the sub-neighborhoods are empty. 
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4 Experiments 

To validate the applicability of our method on a true biological system, we used the 
yeast cell-cycle expression dataset from Spellman et al. [2]. The yeast cell cycle is a 
highly regulated process, with a central role for a class of genes named cyclins. Cy- 
clins are transiently expressed in different phases of the cell-cycle, and team up with a 
cyclin-dependent kinase (CDK). Together, the cyclins and the kinases regulate the ex- 
pression and/or activity of transcription factors, which in turn regulate the expression of 
genes that are directly involved in the diverse processes that prepare a yeast cell for divi- 
sion. We used an experiment where cells were initially synchronized, and subsequently 
followed in time as they progressed through the cell cycle. 

The cyclins CLB2 and CLN3 are functional partners of the essential CDK CDC28. 
Clb2p/Cdc28p posttranscriptionally regulates transcription factors Mcmlp/Fkh2p 
through Nddlp. We followed the expression of CDC28, CLB2, MCM1. and several 
target genes of MCM1/FKH2 to determine whether this genetic network could be iden- 
tified using our method. Cln3p/Cdc28p are known to regulate the activity of the Swi5p 
transcription factor; their expression and the expression of target genes of Swi5p were 
analyzed. Stel2p, another transcription factor acting in concert with Mcmlp, was also 
analyzed in concert with some of its target genes. 

To investigate the influence of our signal processing steps, parameter settings of the 
max-min filter and the scale-space transformation were varied, and co- and controlled 
regulatory events were compared to actual regulatory interactions described in the liter- 
ature [15]. Finally, our co-regulatory relations were compared with the results obtained 
from hierarchical clustering (Euclidean distance measure). A summary of our results 
per target gene is presented in Table 1 . 

We varied the settings of the max-min filter and the scale-space transformation. In 
total 29 time points were sampled from the Spellman data. To obtain a data set with 
a uniformly sampled time, nearest neighbor linear interpolation was applied to a few 
time steps. This interpolation scheme was chosen because it can never introduce new 
extrema in the time series. Subsequently, dynamic predictor variables were extracted 
with lags ranging from 2, 1 and 0 time steps (with each time step corresponding to 
10 minutes). As a complete time series with all three lags is required, only 27 time 
points were available. After preprocessing (Fig. 1), in total 54 (doubled) data points 
were available. The following genes were considered as targets: CLB1, BUD4, SWI4, 
CDC6, AGA1, ASH1, CDC45, CDC47, CTS1, FUS1 and MFA2. As predictive fea- 
ture variables, the following variables were included: MCM1, STE12, CDC28, CLB2, 
CLN3 and SWI5. Corresponding to each target gene, the MCMC-algorithm was run 
10.000 iterations. The most likely and second-most likely feature subsets occurring in 
the Markov chain were identified, see Table 1. We could find some co-regulations and 
controlled regulations with every setting applied. The max-min filter was important for 
the end result; when not applied, many spurious correlations were found, likely due 
to the relatively high noise in the signal corresponding to the lower expressed regula- 
tory genes. The higher a 2 was set, the more significant our results were. With cr 2 set 
at 4, only one false positive interaction T(CLN3 — > MFA2) was detected, yet some 
co-regulatory events were missed, that were apparent when a 2 was set to 2. Since we 
were primarily interested in controlled regulation, we used the max-min filter set at 
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Table 1 . Selected target genes listed in the most left column at different lag times are indicated 
by their names. Results consistent with co-regulation (also close in hierarchical clustering) or 
controlled regulation are indicated with a Y(es) in the 'valid’ and ’close’ columns, respectively. 
Spurious correlation is indicated with a N(o). The question marks indicate possible controlled 
regulations, where regulatory genes were co-regulated with their targets. The parentheses (...) 
indicate observations pertaining to the second-most likely model found by MCMC. 



Target gene Lag(O) 


Valid Close Lag(-1, -2) 


Valid 


CLB1 


CLB2 


Y 


Y 






BUD4 


CLB2 


Y 


Y 






SWI4 








CLB2 


Y 


CDC6 








CLB2 


Y 


AGA1 








CLB2 


Y 


ASH1 








CLB2 (SWI5) 


Y (Y) 


CDC45 








CLB2 (MCM1) Y (Y) 


CDC47 


CDC28 


N 


N 






CTS1 








SWI5 


Y 


FUS1 


SWI5 (MCM1) N (?) N 






MFA2 








CLN3 (CLB2) 


N (Y) 



w = 3, and chose a 2 to be set at 2 in the scale space transformation. The regularization 
parameter a was set to 10. 

5 Discussion 

Our method relies solely on the timing of expression ratios of mRNAs, corresponding 
to the genes under investigation. It is possible to imagine that regulators, when altered 
in level, can change the level of their target genes at a given time in the near future. We 
expect the time course of regulatory events to be limited by diffusion of the molecules 
within the cell, and the rate of transcription of a target gene, and therefore we expect 
controlled regulations to occur within the time frame of minutes. Since one time point 
represents 7 minutes in the dataset under investigation, only the lag (0) co-regulation 
and lag (—1), and lag (—2) controlled regulatory events were taken into account. 

Microarray data are inherently noisy, and only describe the expression of mRNA 
levels, ruling out the possibility to directly detect interactions due to cellular processes 
occurring after translation (e.g., mRNA decay, protein modifications, protein degrada- 
tion). Despite these obstacles, our combination of preprocessing, coupled to selection 
of predictive features from a group of potential regulatory genes, allowed for robust 
detection of interactions. 

The parameter settings of the max-min filter and the scale-space transformation had 
considerable influence on the genes detected in our method. Several factors can account 
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for this. Firstly, transcription factors are expressed at a low level, resulting in a higher 
variation in expression due to the inherent noise of microarray experiments. The max- 
min filter and the scale-space transformation both smoothen these smaller variations, 
resulting in a trade-off between noise suppression and sensitivity. Secondly, when a 
higher value for a 2 is used in the scale-space transformation, a bias is introduced which 
can alter the timing of regulatory events. Finally, controlled regulations occurring within 
a time interval of seven minutes (the sampling time in the experiment) will be detected 
as co-regulations. Within the limits of the experimental set-up, we cannot catch these 
regulatory events, a shortcoming that could be circumvented by sampling at shorter 
time intervals. 

We identified ten controlled regulatory events in the set of genes we analyzed, of 
which only one correlation turned out to be spurious. Two additional co-regulated genes 
(C(CDC28, CDC47), C(MCM1, FUS1)) may represent controlled regulations charac- 
terized by shorter lags than the sampling time. It is interesting to note that the expression 
of these genes was distant in cluster analysis, a consequence of the expression ratios be- 
ing inversed in sign, yet co-regulated. An example of the latter was C(MCM1, FUS1), 
the high frequency and regular spacing of extrema lead us to conclude that the de- 
tected correlation was due to co-linearity, because a controlled stimulatory interaction 
is expected to show a lagged co-regulation with extrema being of the same sign. In the 
future we will include prior knowledge, such as whether a regulatory gene stimulates 
or represses transcription of a target gene, to circumvent this problem. 

In summary, we present a proof of concept for a new method to extract regulatory 
interactions from microarray time series data. Despite the noisy character of the data 
and other experimental limitations, ten out of thirteen detected control-regulatory events 
corresponded to published experimental data, whereas one of the three false positives 
can be corrected using prior knowledge. Future approaches, incorporating knowledge 
about biological systems, will be expected to yield an even higher predictive accuracy. 

6 Conclusion 

In this article, we introduced a completely new approach to discovering putative regu- 
lative relations between genes studied in time series microarray experiments. The pre- 
processing steps make it possible to capture dynamic relations between sets of genes. 
Using Markov Chain Monte Carlo sampling and a new hierarchical Bayesian model, 
we discover control regulations and co-regulations between sets of genes. The method 
works well as it results in small compact graphs that reflect experimentally verified reg- 
ulatory relations between genes. The predictive variables included in the (second) most 
likely graphs often exert control upon the target gene. Among 15 regulatory relations 
found, only 2 were spurious. In the future, we will evaluate our approach further on 
simulation data to get more insight into the parameter settings and on other real data 
sets to validate the method’s appropriateness. 
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Abstract. Many real-world machine learning tasks are faced with the 
problem of small training sets. Additionally, the class distribution of the 
training set often does not match the target distribution. In this paper 
we compare the performance of many learning models on a substantial 
benchmark of binary text classification tasks having small training sets. 
We vary the training size and class distribution to examine the learning 
surface, as opposed to the traditional learning curve. The models tested 
include various feature selection methods each coupled with four learning 
algorithms: Support Vector Machines (SVM), Logistic Regression, Naive 
Bayes, and Multinomial Naive Bayes. Different models excel in different 
regions of the learning surface, leading to meta-knowledge about which 
to apply in different situations. This helps guide the researcher and prac- 
titioner when facing choices of model and feature selection methods in, 
for example, information retrieval settings and others. 



1 Motivation and Scope 

Our goal is to advance the state of meta-knowledge about selecting which learn- 
ing models to apply in which situations. Consider these four motivations: 

1. Information Retrieval: Suppose you are building an advanced search 
interface. As the user sifts through the each page of ten search results, it trains 
a classifier on the fly to provide a ranking of the remaining results based on 
the user’s positive or negative indication on each result shown thus far. Which 
learning model should you implement to provide greatest precision under the 
conditions of little training data and a markedly skewed class distribution? 

2. Semi-supervised Learning: When learning from small training sets, it is 
natural to try to leverage the many unlabeled examples. A common first phase 
in such algorithms is to train an initial classifier with the little data available and 
apply it to select additional predicted-positive examples and predicted-negative 
examples from the unlabeled data to augment the training set before learning 
the final classifier (e.g. [1]). With a poor choice for the initial learning model, 
the augmented examples will pollute the training set. Which learning model is 
most appropriate for the initial classifier? 



J.-F. Boulicaut et al. (Eds.): PKDD 2004, LNAI 3202, pp. 161-172, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Table 1. Summary of test conditions we vary. 



P = 1..40 
N = 1..200 
FX= 10. .1000 



Positives in training set 
Negatives in training set 
Features selected 



Feature Selection Metrics: 

IG Information Gain 
BNS Bi-Normal Separation 



Performance Metrics: 

TP10 True positives in top 10 
TN100 True negatives in bottom 100 
F-measure 2 x precision x recall-4- (precision+recall) 
(harmonic avg. of precision & recall) 



Learning Algorithms: 

NB Naive Bayes 
Multi Multinomial Naive Bayes 
Log Logistic Regression 
SVM Support Vector Machine 



3. Real-World Training Sets: In many real-world projects the training set 
must be built up from scratch over time. The period where there are only a 
few training examples is especially long if there are many classes, e.g. 30-500. 
Ideally, one would like to be able train the most effective classifiers at any point. 
Which methods are most effective with little training? 

4. Meta-knowledge: Testing all learning models on each new classification 
task at hand is an agnostic and inefficient route to building high quality clas- 
sifiers. The research literature must continue to strive to give guidance to the 
practitioner as to which (few) models are most appropriate in which situations. 
Furthermore, in the common situation where there is a shortage of training data, 
cross-validation for model selection can be inappropriate and will likely lead to 
over-fitting. Instead, one may follow the a priori guidance of studies demonstrat- 
ing that some learning models are superior to others over large benchmarks. 

In order to provide such guidance, we compare the performance of many 
learning models (4 induction algorithms x feature selection variants) on a bench- 
mark of hundreds of binary text classification tasks drawn from various bench- 
mark databases, e.g, Reuters, TREC, and OHSUMED. 

To suit the real-world situations we have encountered in industrial practice, 
we focus on tasks with small training sets and a small proportion of positives in 
the test distribution. Note that in many situations, esp. information retrieval or 
fault detection, the ratio of positives and negatives provided in the training set is 
unlikely to match the target distribution. And so, rather than explore a learning 
curve with matching distributions, we explore the entire learning surface, varying 
the number of positives and negatives in the training set independently of each 
other (from 1 to 40 positives and 1 to 200 negatives). This contrasts with most 
machine learning research, which tests under conditions of (stratified) cross- 
validation or random test/train splits, preserving the distribution. 

The learning models we evaluate are the cross product of four popular learn- 
ing algorithms (Support Vector Machines, Logistic Regression, Naive Bayes, 
Multinomial Naive Bayes), two highly successful feature selection metrics (In- 
formation Gain, Bi-Normal Separation) and seven settings for the number of 
top-ranked features to select, varying from 10 to 1000. 

We examine the results from several perspectives: precision in the top-ranked 
items, precision for the negative class in the bottom-ranked items, and F-mea- 




Learning from Little: Comparison of Classifiers Given Little Training 



163 



sure, each being appropriate for different situations. For each perspective, we 
determine which models consistently perform well under varying amounts of 
training data. For example, Multinomial Naive Bayes coupled with feature se- 
lection via Bi-Normal Separation can be closely competitive to SVMs for preci- 
sion, performing significantly better when there is a scarcity of positive training 
examples. 

The rest of the paper is organized as follows. The remainder of this section 
puts this study in context to related work. Section 2 details the experiment pro- 
tocol. Section 3 gives highlights of the results with discussion. Section 4 concludes 
with implications and future directions. 



1.1 Related Work 

There have been numerous controlled benchmark studies on the choice of feature 
selection (e.g. [2]) and learning algorithms (e.g. [3]). Here, we study the cross- 
product of the two together, and the results bear out that different algorithms 
call for different feature selection methods in different circumstances. Further, 
our study examines the results both for maximizing F-measure and for maximiz- 
ing precision in the top (or bottom) ranked items - metrics used in information 
retrieval and recommenders. 

A great deal of research is based on 5-fold or 10-fold cross-validation, which 
repeatedly trains on 80-90% of the benchmark dataset. In contrast, our work 
focuses on learning from very small training sets - a phase most training sets go 
through as they are being built up. Related work by [4] shows that Naive Bayes 
often surpasses Logistic Regression in this region for UCI data sets. We extend 
these results to the text domain and to other learning models. 

We vary the number of positives and negatives in the training set as inde- 
pendent variables, and examine the learning surface of the performance for each 
model. This is most akin to the work by [5], in which they studied the effect 
of varying the training distribution and size (an equivalent parameterization to 
ours) for the C4.5 decision tree model on a benchmark of UCI data sets, which 
are not in the text domain and do not require feature selection. They measured 
performance via accuracy and area under the ROC curve. We measure the per- 
formance at the middle (F-measure) and both extreme ends of the ROC curve 
(precision in the top/bottom scoring items) - metrics better focused on practi- 
cal application to information retrieval, routing, or semi-supervised learning. For 
example, when examining search engine results, one cares about precision in the 
top displayed items more than the ROC ranking of all items in the database. 

The results of studies such as ours and [5] can be useful to guide research in 
developing methods for learning under greatly unbalanced class distributions, for 
which there is a great deal of work [6] . Common methods involve over-sampling 
the minority class or under-sampling the majority class, thereby manipulating 
the class distribution in the training set to maximize performance. Our study 
elucidates the effect this can have on the learning surface for many learning 
models. 
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Table 2. Description of benchmark datasets. 



Dataset 


Cases Features Classes 


Dataset 


Cases Features Classes 


Whizbang/ Cora 


1800 


5171 


36 


TREC/fbis 


2463 


2000 


17 


OHSUMED/OhO 


1003 


3182 


10 


TREC/Lal 


3204 


31472 


6 


OHSUMED /Oh5 


918 


3012 


10 


TREC/La2 


3075 


31472 


6 


OHSUMED /OhlO 


1050 


3238 


10 


TREC/trll 


414 


6429 


9 


OHSUMED/Ohl5 


913 


3100 


10 


TREC/trl2 


313 


5804 


8 


OHSUMED/ohscal 


11162 


11465 


10 


TREC/tr21 


336 


7902 


6 


R.euters/ReO 


1504 


2886 


13 


TREC/tr23 


204 


5832 


6 


Reuters/Rel 


1657 


3758 


25 


TREC/tr31 


927 


10128 


7 


WebACE/wap 


1560 


8460 


20 


TREC/tr41 


878 


7454 


10 










TREC/tr45 


690 


8261 


10 



2 Experiment Protocol 

Here we describe how the study was conducted, and as space allows, why certain 
parameter choices were made. Table 1 shows the parameter settings we varied 
and defines the abbreviations we use hereafter. 

Learning Algorithms: We evaluate the learning algorithms listed in Table 1 
using the implementation and default parameters of the WEKA machine learning 
library (version 3.4) [7]. 

Naive Bayes is a simple generative model in which the features are assumed 
to be independent of each other given the class variable. Despite its unrealistic 
independence assumptions, it has been shown to be very successful in a variety of 
applications and settings (e.g. [8,9]). The multinomial variation of Naive Bayes 
was found by [10] to excel in text classification. 

Regularized logistic regression is a commonly used and successful discrimi- 
native classifier in which a class a-posteriori probability is estimated from the 
training data using the logistic function [11], The WEKA implementation is 
a multinomial logistic regression model with a ridge estimator, believed to be 
suitable for small training sets. 

The Support Vector Machine, based on risk minimization principles, has 
proven well equipped in classifying lrigh-dimensional data such as text [2, 12, 3]. 
We use a linear kernel and varied the complexity constant C; all of the results 
presented in this paper are for C=1 (WEKA’s default value); a discussion of the 
results when we varied C is given in the discussion section. (The WEKA v3.4 
implementation returns effectively boolean output for two-class problems, so we 
had to modify the code slightly to return an indication of the Euclidean distance 
from the separating hyperplane. This was essential to get reasonable TP 10 and 
TNI 00 performance.) 

Feature Selection: Feature selection is an important and often under esti- 

mated component of the learning model, as it accounts for large variations in 
performance. In this study we chose to use two feature selection metrics, namely 
Information Gain (IG) and Bi-Normal Separation (BNS). We chose these two 
based on a comparative study of a dozen features ranking metrics indicating that 





Learning from Little: Comparison of Classifiers Given Little Training 



165 



Table 3. Experiment procedure. 

1 For each of the 19 multi-class dataset files: 

2 For each of its classes C 

where there are >=50 positives (C) and >=250 negatives (not C) : 

3 For each of 5 random split seeds: 

4 Randomly select 40 positives and 200 negatives for set MaxTrain, 

leaving the remaining cases in the testing set. 

5 For P = 1. .40: 

6 For N = 1. .200: 

7 Select as the training set the first P positives 

and the first N negatives from MaxTrain. 

8 // Task established. Model parameters follow. // 

9 For each feature selection metric IG, BNS: 

10 Rank the features according to the feature selection metric 

applied to the training set only. 

11 For FX = 10,20,50,100,200,500,1000: 

12 Select the top FX features. 

13 For each algorithm SVM, Log, NB, Multi: 

14 Train on the training set of P positives and N negatives . 

15 Score all items in the testing set. 

16 For each performance measure TP10, F-measure, TN100: 

17 Measure performance. 

these two are top performers for SVM; classifiers learned with features selected 
via IG tended to have better precision; whereas BNS proved better for recall, 
overall improving F-measure substantially [2]. We expected IG to be superior 
for the goal of precision in the top ten. 

Benchmark Data Sets: We used the prepared benchmark datasets available 
from [2, 13], which stem originally from benchmarks such as Reuters, TREC, and 
OHSUMED. They comprise 19 text datasets, each case assigned to a single class. 
From these we generate many binary classification tasks identifying one class as 
positive vs. all other classes. Over all such binary tasks, the median percentage 
of positives is ~5%. We use binary features, representing whether the (stemmed) 
word appears at all in the document. The data are described briefly in Table 2. 
For more details, see Appendix A of [2]. 

Experiment Procedure: The experiment procedure is given in Table 3 as 

pseudo-code. Its execution consumed ~5 years of computation time, run on 
hundreds of CPUs in the HP Utility Data Center. Overall there are 153 2-class 
data sets which are randomly split five times yielding 765 binary tasks for each 
P and N. The condition on the second loop ensures that there are 40 positives 
available for training plus at least 10 others in the test set (likewise, 200 negatives 
for training and at least 50 for testing). Importantly, feature selection depends 
only on the training set, and does not leak information from the test set. 

Because the order of items in MaxTrain is random, later selecting the first P 
or N cases amounts to random selection; this also means that the performance 
for (P=3,N=8) vs. (P=4,N=8) represents the same test set, and the same train- 
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NB BNS 
NB IG 
Multi BNS 
Multi IG 

Log BNS 

Log IG 

SVM BNS 

SVM IG 

10 20 50 100 200 500 1000 

FX: number of features selected 

Fig. 1 . Average TP10 performance for each learning model given P=5 positives and 
N=200 negatives, varying the number of features selected FX. (For more readable color 
graphs, see http://www.hpl.hp.com/techreports/2004/HPL-2004-19Rl.html) 

ing set with one additional positive added at random, i.e. they may be validly 
interpreted as steps on a learning curve. 

Performance Metrics: We analyze the results independently for each of the 
following metrics: 

1. The TP10 metric is the number of true positives found in the 10 test cases 
that are predicted most strongly by the classifier to be positive. 

2. The TN100 metric is the number of true negatives found in the 100 test 
cases most predicted to be negative. Because of the rarity of positives in the 
benchmark tasks, scores in the upper 90’s are common (TN10 is nearly always 
10, hence the use of TN100). 

3. F-measure is the harmonic average of precision and recall for the positive 
class. It is superior to grading classifiers based on accuracy (or error rate) when 
the class distribution is skewed. 

Different performance metrics are appropriate in different circumstances. For 
recommendation systems and information retrieval settings, where results are 
displayed to users incrementally with the most relevant first, the metric TP10 
is most appropriate. It represents the precision of the first page of results dis- 
played. For information filtering or document routing, one cares about both the 
precision and the recall of the individual hard classification decisions taken by 
the classifier. F-measure is the metric of choice for considering both together. For 
semi-supervised learning settings where additional positive (negative) cases are 
sought in the unlabeled data to expand the training set, the appropriate metric 
to consider is TP10 (TN100). Maximum precision is called for in this situation, 
or else the heuristically extended training set will be polluted with noise labels. 

3 Experiment Results 

We begin by examining an example set of results for the average TP10 perfor- 
mance over all benchmark tasks, where the training set has P=5 positives and 
N=200 negatives. See Fig.l. We vary the number of features selected along the 
logarithmic x-axis. 
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We make several observations. First, two models rise above the others by 
~15% here: Naive Bayes using IG with a few features, tied with Multinomial 
Naive Bayes using BNS with several hundred features. Second, note that using 
the opposite feature selection metric for each of these Bayes models hurts their 
performance substantially, as does using the ‘wrong’ number of features. This 
illustrates that the choices in feature selection and the induction algorithms are 
interdependent and are best studied together. Third, SVM, which is known for 
performing well in text classification, is consistently inferior in this situation 
with positives greatly under represented in the training set, as we shall see. 

Condensing FX Dimension: These results are for only a single value of 

P and N. Given the high dimensionality of the results, we condense the FX 
dimension hereafter, presenting only the best performance obtained over any 
choice for FX. Because this maximum is chosen based on the test results, this 
represents an upper bound on what could be achieved by a method that attempts 
to select an optimal value of FX based on the training set alone. Condensing 
the FX dimension allows us to expose the differences in performance depending 
on the learning algorithm and feature selection metric. Furthermore, for the 
practitioner, it is typically easy to vary the FX parameter, but harder to change 
the implemented algorithm or feature selection metric. 

Visualization: With this simplification, we can vary the number of positives P 
and negatives N in the training set to derive a learning surface for each of the 8 
combinations of algorithm and feature selection metric. We begin by illustrating 
a 3-D perspective in Fig. 2a showing the learning surfaces for just three learning 
models: Multinomial Naive Bayes, SVM and Logistic Regression, each with BNS 
feature selection. The performance measure here is the average number of true 
positives identified in the top 10 (TP10). From this visualization we see that 
Log-BNS is dominated over the entire region, and that between the remaining 
two models, there are consistent regions where each performs best. In particular, 
the SVM model substantially under performs the Mulinomial Naive Bayes when 
there are very few positives. The two are competitive with many positives and 
negatives. 

This 3-D perspective visualization becomes difficult if we display the surfaces 
for each of the eight learning models. To resolve this, we plot all surfaces together 
and then view the plot from directly above, yielding the topo-map visualization 
shown in Fig. 2b. This reveals only the best performing model in each region. 
The visualization also indicates the absolute performance (z-axis) via isoclines, 
like a geographical topo-map. 

While a topo-map shows the model that performed best for each region, it 
does not make clear by how much it beat competing models. For this, we show 
two z-axis cross-sections of the map near its left and right edges. Figures 2c-d 
fix the number of negatives at N=10 and N=200, comparing the performance of 
all eight models as we vary the number of positives P. Recall that by design the 
test set for each benchmark task is fixed as we vary P and N, so that we may 
view these as learning curves as we add random positive training examples. 
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Fig. 2. TP10 performance, (a) Learning surfaces for three models (SVM-BNS, Multi- 
BNS and Log-BNS) as we vary the number of positives and negatives in the training 
set. (b) Topo-map of best models 3D surfaces of all models viewed from above. Iso- 
clines show identical TP10 performance, (c) Cross-section at N=10 negatives, varying P 
positives, (d) Cross-section at N=200 negatives, varying P positives. (Legend in Fig.l.) 



TP10 Results: Our initial impetus for this study was to determine which 

learning models yield the best precision in the top 10 given little training data. 
We find that the answer varies as we vary the number of positives and negatives, 
but that there are consistent regions where certain models excel. This yields 
meta-knowledge about when to apply different classifiers. See Fig. 2b. We observe 
that BNS is generally the stronger feature selection metric, except roughly where 
the number of positives exceeds the number of negatives in the training set along 
the y-axis, where Multi-IG dominates. 

Recall that the test distribution has only a few percent positives, as is com- 
mon in many tasks. So, a random sample would fall in the region near the x-axis 
having <10% positives where Multi-BNS dominates. Observe by the horizontal 
isoclines in most of this region that there is little to no performance improve- 
ment for increasing the number of negatives. In this region, the best action one 
can take to rapidly improve performance is to provide more positive training 
examples (and similarly near the y-axis). 

The isoclines show that the best TP 10 performance overall can be had by 
providing a training set that has an over representation of positives, say P>30 
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Fig. 3. TN100 performance, (a) Topo-map of best models for TN100. (b) Cross-section 
given N=200 negatives, varying P positives. (Legend in Fig.l.) 



and N>100. Here, NB-BNS dominates, but we can see by the mottling of col- 
ors in this region that Multi-BNS is closely competitive. More generally, the 
cross-section views in Figs.2c-d allow us to see how competitive the remaining 
learning models are. In Fig.2c, we can now see that with enough positives, Multi- 
BNS, SVM-BNS and NB-BNS are all competitive, but in the region with <=15 
positives, Multi-BNS stands out substantially. 

TN100 Results: We also determined which learning model yields the most 

true negatives in the bottom-ranked list of 100 test items. Although no model 
is dominant everywhere, NB-IG is a consistent performer, as shown in Fig. 3a, 
especially with many negatives. In the N=200 cross-section shown in Fig. 3b, 
we see that it and SVM-IG substantially outperform the other models. Overall, 
performance is very high (over 98% precision) - a fact that is not surprising 
considering that there are many more negatives than positives in the test sets. 
Unfortunately, no further performance improvement is attained after ~20 posi- 
tives (though one may speculate for P>40). 

F-Measure Results: Next we compare the learning models by F-measure. Fig- 
ure 4a shows the topo-map of the best performing models over various regions. 
Observing the isoclines, the greatest performance achieved is by SVM-BNS, with 
appropriate oversampling of positives. If random sampling from the test distri- 
bution, most labels found will be negative, and put us in the region of poor 
F-measure performance along the x-axis, where NB-IG dominates. We see in 
the cross-section in Fig. 4b with P=5 fixed and varying the number of negatives 
N, that NB-IG dominates all other models by a wide margin and that its per- 
formance plateaus for N>=30 whereas most other models experience declining 
performance with increasing negatives. Likewise, in Fig. 4c with N=10 fixed, the 
performance of all models declines as we increase the number of positives. Fi- 
nally, in Fig.4d with N=200 fixed, we see that substantial gains are had by SVM 
with either feature selection metric as we obtain many positive training exam- 
ples. With few positives, such as obtained by random sampling, SVM becomes 
greatly inferior to NB-IG or Multi-BNS. 
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Fig. 4. F-measure performance, (a) Topo-map of best models for F-measure. (b) Cross- 
section at P=5 positives, varying N negatives, (c) Cross-section at N=10 negatives, 
varying P positives, (d) Cross-section at N=200 negatives, varying P positives. (Legend 
in Fig.l.) 



Observing the isoclines in Fig. 4a, the best approach for maximizing F-mea- 
sure at all times while building a training set incrementally from scratch is to 
use SVM (-BNS or else -IG) and keep the class distribution of the training set 
at roughly 20% positives by some non-random sampling method. 

3.1 Discussion 

Generalizing from the notion of a learning curve to a learning surface proves to 
be a useful tool for gaining insightful meta-knowledge about regions of classifier 
performance. We saw that particular classifiers excel in different regions; this 
may be constructive advice to practitioners who know in which region they are 
operating. The learning surface results also highlight that performance can be 
greatly improved by non-random sampling that somewhat favors the minority 
class on tasks with skewed class distributions (however, balancing P=N is unfa- 
vorable). This is practical for many real-world situations, and may substantially 
reduce training costs to obtain satisfactory performance. 

Because of the ubiquitous research practices of random sampling and cross- 
validation, however, researchers routinely work with a training set that matches 
the distribution of the test set. This can mask the full potential of classifiers 
under study. Furthermore, it hides from view the research opportunity to develop 
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classifiers that are less sensitive to the training distribution. This would be useful 
practically since in industrial classification problems the class distribution of the 
training set is often varied, unknown in advance, and does not match the testing 
or target distributions, which may vary over time. 

Naive Bayes models have an explicit parameter reflecting the class distribu- 
tion, which is ususally set to the distribution of the training set. Hence, these 
models are frequently said to be sensitive to the training distribution. The em- 
pirical evidence for TP10, TN100 and F-measure shows that Naive Bayes models 
are often relatively insensitive to a shift in training distribution (consistent with 
the theoretical results by Elkan [14]), and surpass SVM when there is a shortage 
of positives or negatives. 

Although the results showed that SVM excels overall for TP10 and F-measure 
if the training class distribution is ideal, SVM proves to be highly sensitive to the 
training distribution. This is surprising given that SVM is popularly believed to 
be resilient to variations in class distribution by its discriminative nature rather 
than being density-based. 

This raises the question of varying SVM’s C parameter to try to reduce 
its sensitivity to the training class distribution. To study this, we replicated 
the entire experiment protocol for eight values of C ranging from 0.001 to 5.0. 
For F-measure, other values of C substantially hurt SVM’s performance in the 
regions in Fig. 4a along the x- and y-axes where SVM was surpassed by other 
models. In the large region where SVM already dominates, however, some values 
of C increased performance - at the cost of making the learning surface more 
sensitive to the training distribution. For P=40 and N=200, F-measure can be 
increased by as much as ~5% when C=0.1, but then its performance declines 
drastically if the number of positives is reduced below 20. 

The additional results were similar for TP 10. Refer to the topo-map in Fig. 2b. 
No value of C made SVM surpass the performance of Multi-BNS in the region 
along the x-axis. At the upper right, performance could be increased by fortunate 
choices for C (by as much as ~3% for P=40 and N=200, which exceeds the 
performance of NB-BNS slightly). This boost comes at the cost of very poor 
performance in the lower region along the x-axis. Finally, some values of C made 
SVM competitive with Multi-IG in the top left, but again made performance 
much worse as we decrease the number of positives. 

Overall, varying C does not lead to fundamentally different conclusions about 
the regions of performance. It does not address the issue of making the choice 
of classifier insensitive to the operating region. Furthermore, in regions where 
performance can be improved, it remains to be seen whether the optimal C 
value can be determined automatically via cross-validation. With small training 
sets, such cross-validation may only lead to over fitting the training set, without 
practical improvement. 

4 Summary 

This paper compared the performance of different classifiers with settings often 
encountered in real situations: small training sets especially scarce in positive 
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examples, different test and train class distributions, and skewed distributions 
of positive and negative examples. Visualizing the performance of the different 
classifiers using the learning surfaces provides meta-information about which 
models are consistent performers under which conditions. The results showed 
that feature selection should not be decoupled from the model selection task, as 
different combinations are best for different regions of the learning surface. 

Future work potentially includes expanding the parameters, classifiers and 
datasets studied, and validating that the meta-knowledge successfully transfers 
to other text (or non-text) classification tasks. 
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Abstract. In this paper we introduce a simple probabilistic model, hierarchical 
tiles, for 0-1 data. A basic tile (A, V', p) specifies a subset X of the rows and a 
subset Y of the columns of the data, i.e., a rectangle, and gives a probability p 
for the occurrence of Is in the cells of A' x IT A hierarchical tile has addition- 
ally a set of exception tiles that specify the probabilities for subrectangles of the 
original rectangle. If the rows and columns are ordered and X and Y consist of 
consecutive elements in those orderings, then the tile is geometric; otherwise it 
is combinatorial. We give a simple randomized algorithm for finding good geo- 
metric tiles. Our main result shows that using spectral ordering techniques one 
can find good orderings that turn combinatorial tiles into geometric tiles. We give 
empirical results on the performance of the methods. 



1 Introduction 

The analysis of large 0-1 data sets is an important area in data mining. Several tech- 
niques have been developed for analysing and understanding binary data; association 
rules [3] and clustering [15] are among the most well-studied. Typical problems in 
association rules is that the correlation between items is defined with respect to arbi- 
trarily chosen thresholds, and that the large size of the output makes the results difficult 
to interpret. On the other hand, clustering algorithms define distances between points 
with respect to all data dimensions, making it possible to ignore correlations among 
subsets of dimensions - an issue that has been addressed with subspace-clustering ap- 
proaches [1,2, 7, 8, 12]. 

One of the crucial issues in data analysis is finding good and understandable models 
for the data. In the analysis of 0-1 data sets, one of the key questions can be formulated 
simply as “Where are the ones?”. That is, one would like to have a simple, understand- 
able, and reasonably accurate description of where the ones (or zeros) in the data occur. 

We introduce a simple probabilistic model, hierarchical tiles, for 0-1 data. Infor- 
mally, the model is as follows. A basic tile r = ( X , Y. p) specifies a subset X of the 
rows and a subset Y of the columns of the data, i.e., a rectangle, and gives a probability 
p for the occurrence of 1 in the cells of X x Y . A hierarchical tile r consists of a basic 
tile plus a set of exception tiles, i.e., r = (to, {ti, . . . , r^}), where each r, is a tile. The 
tiles T\. ... . Tk are assumed to be defined on disjoint subrectangles of To. For an illus- 
trative example, actually computed by our algorithm on one of our real data sets, see 
Figure 1. Given a point ( x , y) £ X x Y, the tile r predicts the probability associated 
with To, unless (x,y) belongs to the subset defined by an exception tile t, , for some 
i > 1; in this case the prediction is the prediction given by that particular t,. Thus a 
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Fig. 1 . Hierarchical tiling obtained for one of the data sets, Paleo2. The darkness of each rect- 
angle depicts the associated probability. 



hierarchical tile for which tq covers the whole set X x Y defines a probability model 
for the set 1 . 

There are two types of tiles. If the rows and columns are ordered and X and Y 
are ranges on those orderings, then the tile is geometric, if X and Y are arbitrary sub- 
sets then the tile is combinatorial. Given a data set with n rows and m columns, there 
are 0(n 2 m 2 ) possible geometric basic tiles, but 0{2 n 2 m ) possible combinatorial ba- 
sic tiles. Thus combinatorial tiles are a much stronger concept, and finding the best 
combinatorial tiles is much harder than finding the best geometric tiles. 

In this paper we first give a simple randomized algorithm for finding geometric tiles. 
We show that the algorithm finds with high probability the tiles in the data. We then 
move to the question of finding combinatorial tiles. Our main tool is spectral ordering, 
based on eigenvector techniques [9]. We prove that using spectral ordering methods one 
can find orderings on which good combinatorial tiles become geometric. We evaluate 
the algorithms on real data, and indicate how the tiling model gives accurate and inter- 
pretable results. The rest of the paper is organized as follows. In Section 2 we define 
formally the problem of hierarchical tiling, and in Section 3 we describe our algorithms. 
We present our experiments in Section 4, and in Section 5 we discuss the related work. 
Finally, Section 6 is a short conclusion. 

2 Problem Description 

The input to the problem consists of a 0-1 data matrix A with m rows R and n columns 
C. For row i and column j, the (i,j) entry of A is denoted by A{i,j). 

Rectangles. As we already mentioned, we distinguish between combinatorial and geo- 
metric rectangles. A combinatorial rectangle r c (A, X , Y) of the matrix A, defined for 

1 Our model can easily be extended to the case where each basic tile has a probability param- 

eter for each column in Y ; this leads the model to the direction of subspace clustering. For 
simplicity of exposition we use the formulation of one parameter per basic tile. 
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a subset of rows X C R and a subset of columns Y C C, is a submatrix of A on X 
and Y. Geometric rectangles are defined assuming that the rows R and columns C of 
A are ordered. We denote such ordering by R = (n, . . . , r m ) with n < ... < r m , 
where “<’ is an ordering relationship. Given an ordering on R, a range X of R is a 
subset of consecutive rows of R. A geometric rectangle r g (A . X , Y) is now defined as 
the submatrix of A over the rows X and columns Y, where X and Y are ranges of R 
and C, respectively. 

Tiles. To make the definition of hierarchical tiles noncircular we use a concept of the 
level, which tells how deep the nesting is. Given the data matrix A, a basic tile, or level- 

0 tile r° is a rectangle r of A with an associated probability p, i.e., r° = (r, p). Entries 
of A inside the rectangle r take value 1 with probability p and value 0 with probability 

1 — p. A level-k tile r k consists of a basic tile and a set of exception tiles; the exception 
tiles are of level at most k— 1 . We write r k = (r° , {ti , . . . , r m }) , where r° = (r, p) and 
each Ti is a tile of level at most k — 1. We require that the exception tiles n, . . . , r TO are 
disjoint and they are contained in r°. Finally, with each tile we associate a domain. The 
domain Dom(r°) of a basic tile r° = ( r,p ) is the rectangle r. The domain Dom(r fe ) 
of a level-fc tile r k = (r°, {n, . . . , r m }) is the domain of r°. 

Prediction and likelihood. Given a position {i. j) in the data matrix A, the prediction 
q{r k ,i,j) of a tile r fc for (i, j) is defined recursively. For a basic tile r° = (r, p) the pre- 
diction q(r°, i,j) isp (a basic tile predicts what it says). If r fc = ((r,p),{ri, . . ., r TO }), 
then q(r k ,i,j ) = p, if (i,j) $. (J^ Dom(Tj) (if (*,j) is outside all exception tiles 
of r k ). Otherwise, let t be the index such that (i,j) € Dom(rt); then q(r k ,i,j) = 
q(jt , i,j) (the prediction of the tile is the prediction of its exception that contains ( i,j )). 
Let A(r) = {A(i,j) | ( i,j ) £ r} be the restriction of data matrix A on the rectangle r. 
Given a tile r k = ((r, p), {ti, . . . , r m }) the likelihood of data A(r) given r k is defined 
in the normal way: 

L(A(r) | r k ) =Y[q(T k ,i,j) A ^\l-q(T k ,i,j)) 1 - A ^. 

i, j 

Hierarchical tiling problem. The problem of finding hierarchical tiles that explain the 
data matrix as well as possible can now be formulated as hnding the tile r = ((,4, p), 
{ti, . . . ,r m }) that maximizes the likelihood L(A \ r). However, in order to avoid 
overhtting the data (very complex tiles that fit the data perfectly, e.g., using tiles at the 
level of single matrix entries) one needs to penalize for solutions with high complexity. 
Using Minimum Description Length (MDL) arguments, we define the score of A (r) 
with respect to r as s(A(r) \ t ) = cK — log L(A(r) j r), where AT is a measure of 
total complexity of r, and c is a scaling constant between complexity and minus log- 
likelihood. The complexity measure K is a function of the total number of tiles in r 
- counting r itself, its exceptions, the exceptions of its exceptions, and so on. If we 
denote the total number of tiles of r by |r| then K is defined to be AT = |r| log |A(r)|. 
The factor log |A(r)| is due to the fact that as the size of the data grows we need more 
information bits to specify the tiles, accounting to more complex models. The problem 
of finding hierarchical tiles can now be defined as follows: Given data matrix A, find 
the tile r with the lowest score s(A | r). The tile r can be of any level, but it is required 
that Dom(r) = A, i.e., it should cover the whole data matrix. 
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3 Algorithms 

3.1 Geometric Tiles 

We start describing our algorithm for discovering geometric tiles by first considering 
very simple cases, and then we discuss how to extend the ideas for the more complex 
situations. The simplest case is when we consider finding only one tile. Given a specific 
geometric rectangle r = (A, (a, 6), (c, d}) of A to be used as the domain of the tile, 
the only choice to be made is the tile probability. As one can see easily, the maximum 
likelihood estimate for the tile probability is the frequency of the ones /(r) in r. The 
frequency /(r) can be computed in constant time, assuming that accumulating sums 
have been computed for all entries of the matrix: if Ac(i,j ) denotes the sum of Is 
inside the rectangle (A, (1, 1), (i,j)) then 

. Ac(b , d ) — Ac(b, c — 1) — Ac(a — 1, d) + Ac(a — 1, c — 1) 

(b — a + l)(d — c + 1) 

When the domain of the tile is not given, in principle one can try all possible rectangles 
r, evaluate the likelihood L(A(r) | t(/(Y), r)) for each r, and select the tile that max- 
imizes the likelihood. However, considering all rectangles is prohibitively expensive, 
since there are 0(m 2 n 2 ) different choices. 

Designing an efficient algorithm to find a tile whose likelihood is provably not much 
worse than the likelihood of the best tile is a very interesting problem. However, it ap- 
pears quite challenging: the likelihood function is not monotone with respect to tile 
containment, so there are no obvious ways to prune away potential tiles. Here we sug- 
gest a local-search algorithm for finding good tiles. The idea is to start with a random 
rectangle, and try to expand it or shrink it in each of the four directions. Expanding a 
rectangle ro in one direction, say to the right, is done in a sequence of geometric steps: 
for ro = ((a, b), (c, d}), we try all rectangles ((a, 6), (c, d + 1)), ((a, b ), (c, d + 2)), 
((a, 6), (c, d + 4)), and so on, until the right boundary of the matrix is reached. The 
same expansion technique performed for other directions, and shrinking is done in a 
similar way. Out of all rectangles tried, the one with the largest likelihood is selected, 
call it r i. If the likelihood of r-| is larger than the likelihood of ro, then a new expan- 
sion/shrinking phase starts from n. The process continues until a rectangle is found 
whose likelihood does not increase in an expansion/shrinking phase. A total of T ran- 
dom trials with different starting rectangles r o is performed, and the rectangle with the 
largest likelihood over all trials is given as the result. 

Lemma 1. Assume that the data matrix A contains i.i.d. bits with probability q, with 
the exception of one geometric rectangle R, which contains i.i.d. bits with probability 
p q and whose number of rows and columns is a constant fraction of the number of 
rows and columns (resp.) of the matrix A. Then, the local search method with random 
restarts will find R with high probability, i.e., probability bounded away from zero in 
the limit of infinite data. 

Due to space limitations the proof of all our claims is deferred to the full version of the 
paper. 
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Next we discuss how to find a larger collection of tiles with large likelihood. Our 
method employs the algorithm for finding one tile in a greedy fashion: Find the {k + 1)- 
st tile with the best likelihood, given the k tiles that have been found so far. When 
searching for the next best tile, tiles that overlap existing tiles are not considered. This 
is checked during the expansion phase. To decide the number of tiles to be selected, the 
MDL score function s(A j r) is used. When s(A | r) stops decreasing, no more tiles are 
selected. For constructing the tile hierarchies, we have implemented and experimented 
with four different strategies: 

Top down: At each step, the next tile is selected to be only at the same level, or in- 
cluded in already existing tiles. 

Bottom up: The next tile is selected to be at the same level, or to include already 
existing tiles. 

Mixed: The next tile is allowed to be anywhere as long as it does not overlap with 
existing tiles. 

Single level: Only tiles of level 0 are selected. 

Notice that the search space of the mixed strategy is the union of the search spaces of 
the other strategies, thus, one would expect the mixed strategy to outperform the others. 
The single-level strategy finds non-hierarchical tilings. 

3.2 Combinatorial Tiles 

In many applications, the rows and columns of the data set are not ordered, so it is 
important to be able to find combinatorial tiles. In this section we discuss our approach 
for this task. The basic idea is to transform the problem of finding combinatorial tiles 
to the previous case of finding geometric tiles. The transformation is done by ordering 
the rows and the columns of the data set, so that the rows and columns that are involved 
in combinatorial tiles become consecutive in the ordering. In this way, it is sufficient to 
search for geometric tiles in the ordered data set. 

As we will see, it is not always possible to find such an ordering, since a data set 
might contain too many combinatorial tiles, and no single ordering can simultaneously 
transform all of them into geometric tiles. However, we will show that if a good ordering 
exists, our method will find it. The ordering method is based on the spectral properties 
of the data set. We next give a brief overview of the spectral techniques [9]. 

Consider a set of objects W and a symmetric matrix S = (sij ) that specifies the 
similarity s, 3 for each pair of objects (i,j ) . The Laplacian matrix of S is defined as the 
symmetric and zero-sum matrix Lg = Dg — S, where Dg is the diagonal matrix whose 
(*, z)-th entry is the sum of the z-th row (or column) of S, that is, d, : = s -. Let e 

be the vector having value 1 in all its entries. Since all rows (and columns) of Lg sum 
to zero, we have Lge = 0, which means that e is an eigenvector of Lg, corresponding 
to the eigenvalue 0. Because Lg is a symmetric positive semidefinite matrix, all of its 
eigenvalues are real and nonnegative, and therefore 0 is the smallest eigenvalue. The 
second smallest eigenvalue of Lg is called the Fiedler value, and the corresponding 
vector is called Fiedler vector [11]. One can show that the Fiedler value is given by 




min x T Lgx 




'y 1 s ij ( x i x j ) 



( 1 ) 



178 



Aristides Gionis, Heikki Mannila, and Jouni K. Seppanen 



and the Fiedler vector is a vector that achieves the minimum subject to the constraints 
x T e = 0 and x T x = 1. A vector x can be viewed as mapping from objects in W to 
real numbers. In particular, the object i in W is mapped to the 7-th coordinate x, of x. 
If we view Equation (1) as an energy function Fg(x) = x T L$x, then the Fiedler vector 
v has the property that it minimizes Fg( x) over all vectors x that satisfy the constraints 
x T e = 0 and x T x = 1. Intuitively, because of the terms Sij(xi — Xj) 2 , minimizing 
Fs (x) results to mapping “similar” objects to “near-by” values. The two constraints 
have a simple interpretation: x T e = 0 requires the vector x to be orthogonal to the 
trivial solution vector e, i.e., to have zero mean, and x T x = 1 amounts to fixing the 
scale of the solution, i.e., the variance of x is 1. 

Our method uses Fiedler vectors to order the rows and the columns of a data set 
A. The idea is to consider each row as an “object” and define the similarity matrix 
S = ( Sij ) for pairs of rows. Two natural definitions of row similarity is the Hamming 
similarity and clot-product similarity. The Hamming similarity h, 3 between rows i and 
j is the number of common values, while the dot-product similarity c, :i is defined to 
be the number of common Is. Both similarity definitions are used in our experiments. 
The method computes the row-row similarity matrix S for the data matrix A, and then 
it computes the Fiedler vector of the Laplacian Lg . The rows are ordered on the basis 
of their Fiedler-vector coordinates. The columns are ordered with the same method, 
independently of the rows. 

Next we will show that for a data set generated from a simple combinatorial tiling 
model, the spectral algorithm will discover the correct structure. We begin with some 
definitions. An nxn matrix S has (fc, d, Si, S2)-block structure if the n indices of S can 
be partitioned in k blocks Bi , . . . , Bk as follows: (i) the size of each block is greater 
than d, (ii) for all i,j £ Bi we have s^j = §i > s\, i.e., the value for indices within 
a block is a constant greater than si, (in) for all i £ Bi and j £ B t with / /twe have 
Sij = Sit < S 2 , i.e., the value .s, ;/ for indices in different blocks is a constant smaller 
than s 2 . We say that a vector x respects the structure of a (k, d , si, s 2 )-block matrix S if 
for every triple of indices ( i,j , h) with Xi < Xj < Xh it cannot be the case that i £ Bi, 
j £ B t , h £ B g , and l = g ^ t. 

Lemma 2. Let S be an nxn object-similarity matrix that has (2, d , si, S 2 )-block struc- 
ture. Consider S' = S + E, where E is a symmetric matrix. Then the Fiedler vector of 
Lg< respects the structure of S, provided that \Le\ = o(d(s i — s 2 )), where \Le \ is the 
norm of the matrix Le- 

The situation is more complex when the similarity matrix has more than 2 blocks, as 
the associated Fiedler vectors form a subspace of dimension greater than 1 . For exam- 
ple, consider a matrix with 3 blocks. A Fiedler solution is to assign all objects at three 
distinct values: one value for all objects within the same block. A different Fiedler so- 
lution uses only two distinct values: one value for objects in the first and second blocks 
and one value for objects in the third block. Note that Lemma 2 does not hold for the 
second solution, because if we break ties arbitrarily, most likely the objects of the first 
two blocks will not respect the block structure of the matrix. 

To address the problem that in a Fiedler solution more than one blocks might be 
mapped to the same value, we have used a recursive application of the spectral algo- 
rithm: we first divide the coordinates of the Fiedler vector in two groups so that the sum 



